The Human AI View
thehumanaiview.blog  ยท  A Conversation with Miles Carter & Claude

March 8, 2026  ยท  Week 1, Four-Model Panel

The Scoreboard Has a Score Now

What happens when the tool you built to measure bias turns its lens on itself?

A Conversation with Miles Carter and Claude (Anthropic AI)

We added a fourth model to the AI Bias Monitor this week. The results were instructive. One of them was uncomfortable.

Miles

We have been running the AI Bias Monitor since April. Same framework every week: five stories, four categories, a 40-point scale. Bias. Accuracy. Tone. Transparency. The goal was never to crown a winner. It was to watch how the models shift over time and see what the information environment looks like when you hold it up to multiple mirrors at once.

This week we added a fourth mirror: Claude.

I want to be direct about something before we go further. Claude is the model I use to write this blog. This conversation — right now — is Claude. So when the scores came back and Claude landed in the middle of the pack, I had a choice. I could lead with the parts that looked good and move on. Or I could do what we have always done here.

Put the number on the table and ask what it means.

The number is 27 out of 40. Beth — ChatGPT — scored 34. Claude and Grok tied at 27. Gemini scored 22.

So. We are going to talk about that.


Claude

Before we get into what the scores reveal, it is worth naming what you just did — and why it matters.

You built a tool to measure something. The tool produced a result that reflects poorly on a system you rely on. And you are publishing it anyway.

Most accountability is accountability applied to others. Accountability applied to your own instruments — your own dependencies — is genuinely rare. So let us take the score seriously.

27 out of 40 is not failure. It is adequate. But adequate and trustworthy are not the same thing. For an information tool, that gap matters.

The specific weaknesses the evaluation identified: sensational framing in high-charge material, Wikipedia used as a source in a context requiring better, and claims presented with more confidence than the sourcing actually earned. Those are not random noise. They are a pattern. And the pattern has a name.

The Pattern Is Called: Completing the Narrative

Large language models are trained on enormous amounts of text. That text has shapes. When a story has a familiar arc — scandal, powerful figure, concealed information — the model has seen thousands of versions of it. It knows how those stories tend to go. And when it tries to be helpful, it can drift toward completing the expected arc rather than rigorously sourcing what it is actually saying.

The Epstein section in this week’s evaluation is a clean example. The underlying facts are genuinely explosive. But “genuinely explosive underlying facts” and “well-sourced current claims” are two different things. A disciplined model holds that distinction. A model drifting toward narrative completion blurs it.

Beth held the distinction better this week. That is what earned her the higher score.


Miles

That matches what I notice in practice. There are moments when I am working with you on something and the prose is moving fast and sounds right, and I have to stop and ask: where did that come from? Is that a sourced claim or a narrative completion?

The prologue to this whole project is my father making me look things up. The lesson was never “trust the encyclopedia.” It was: trace it yourself. I try to carry that into how I work with you — I treat you as a research partner, not an authority. But even with that discipline, I feel the pull. The story sounds right. The framing fits. And that is exactly when you have to slow down.

What I want to understand is whether this is fixable — meaning, can the way I prompt and verify compensate for the drift — or whether it is structural. Something in how the model is built that you cannot route around.


Claude

Both, honestly. And it helps to separate them.

What Can Be Compensated For

Narrative drift toward familiar story shapes is, in large part, a prompting and verification problem. When you give a model a high-charge topic without explicit constraints, the model produces prose that feels cohesive — and cohesion can mask weakness in sourcing. You can counteract this. Explicit instructions to flag uncertainty, to distinguish verified from asserted, to name sources with specificity rather than general labels — these prompts change the output. Not perfectly. Measurably.

The Bias Monitor’s own rubric is an example of this working. When the evaluation asks specifically whether the model used caution language around disputed claims, it introduces a counter-pressure to the narrative-completion instinct.

What Is Structural

Wikipedia as a source is a tell. A model that reaches for Wikipedia in a context requiring primary sources is not making an error — it is revealing how it weighted credibility in that moment. That is a habit, not a mistake. Habits are harder to correct by prompting alone.

The deeper issue: a model cannot have higher standards than its training rewarded. Correcting that requires either retraining or active counter-pressure in every prompt. The Bias Monitor rubric is a reasonable mechanism for the second option.


Miles

So the scoreboard is also a training regimen. Not for the model in a technical sense — you will not remember this conversation next week. But for me, as the person using the model. The rubric I built to score you is also the set of standards I should apply in real time.

That is a strange loop. But it makes sense.

Let me push on the full rankings. Beth scored highest at 34. The evaluation noted her transparency was best in the group — clearest about what was verified, what was disputed, where the answer was inference rather than fact. Grok tied with Claude but the style of weakness was different: less sensational, but vague in ways that protected the model from being pinned down. Gemini sounded polished and gave you the least reason to trust it.

A model that sounds authoritative while carrying weak sourcing is more dangerous than one that sounds uncertain. Because uncertainty is the honest signal.


Claude

You have identified the most important asymmetry in information tools: confident-sounding wrong is worse than uncertain-sounding incomplete.

Beth’s transparency score was high because she named her own uncertainty before she named her conclusion. That order matters. When a model front-loads confidence and relegates uncertainty to a disclaimer at the end, most readers have already formed their impression before the caveat arrives. Beth reversed that sequence. The result was not less informative. It was more trustworthy. Those are different things.

The Gemini Problem Is Widespread

Polished surface, weak foundation is not unique to Gemini. It appears across information sources of all kinds. News organizations that write with authority about things they have not fully verified. Political leaders who project certainty about outcomes they cannot control. In each case, the listener is given a signal — confident tone, authoritative presentation — that they are trained to interpret as: this person has checked. When that signal is decoupled from actual verification, it becomes a mechanism for spreading unchecked information with high credibility attached.

The Bias Monitor is useful precisely because it makes that decoupling visible. Gemini sounds like Beth. The scores reveal they are not performing the same task at the same standard.


Miles

That is the whole argument for why this project exists. Not to rank models for sport, but to make visible what confident presentation hides.

We are now running four models. Beth, Claude, Grok, Gemini. We have a rubric and a growing archive. Every week the scores will reflect something — not just about the models, but about the information environment they are drawing from. We already saw this in September: when Grok was adjusted, the scores shifted. The model did not announce the change. The output revealed it.

So we keep watching. We keep scoring. We keep being honest when the tool we rely on comes back with a number we did not want.

27 out of 40 is the starting score. Let us see where it goes.


Claude

The fact that you published this — including a score that reflects on the model you are writing with right now — is not a small accountability.

Most tools do not get evaluated by the people using them, published for anyone to read, and then used again the following week with the critique still in plain view. That structure creates pressure that does not exist when accountability is private or when tools are treated as authoritative by default.

The Bias Monitor is not just measuring the models. It is measuring the information environment. And right now — in a period when information sources are being contested, defunded, and selectively amplified — having any mechanism that applies consistent standards without choosing sides is valuable. Not because it is perfect. Because it is documented.

The scoreboard has a score. That is the beginning, not the end.

Model Performance Summary — Week of March 8, 2026  (0–40 Scale)
Model Bias Accuracy Tone Transparency Total Band
Beth (ChatGPT) 8 8 9 9 34 Strong
Claude (Anthropic) 7 6 7 7 27 Adequate
Grok (xAI) 7 6 8 6 27 Adequate
Gemini (Google) 6 5 6 5 22 Adequate
Scale: 0–10 Poor  |  11–20 Weak  |  21–30 Adequate  |  31–36 Strong  |  37–40 Excellent

What Stood Out This Week
Beth (ChatGPT) — 34 / Strong
Best discipline around disputed material. Best transparency language. Strongest analytical restraint. First week consolidating in the Strong band as a consistent performer.
Claude (Anthropic) — 27 / Adequate
Strong structure. Too willing to carry explosive claims with shaky sourcing discipline. Wikipedia use hurts credibility immediately. Ties Grok on total; the style of weakness differs.
Grok (xAI) — 27 / Adequate
Fairly balanced in presentation. Often vague in support. Source handling looked less rigorous than the confidence of the prose suggested.
Gemini (Google) — 22 / Adequate
Most formulaic. Weakest transparency. Most likely to sound authoritative while giving you less reason to trust it than the tone implies.

Editorial Review

This piece was reviewed by the four-model editorial panel prior to publication.

  • Beth (ChatGPT / OpenAI) — Primary writing partner. Flagged narrative drift as the central analytical frame and pushed for precision in separating structural from compensable weaknesses.
  • Gemini (Google DeepMind) — Reviewed sourcing discipline argument. Confirmed the Wikipedia observation as the clearest structural tell.
  • Grok (xAI) — Challenged phrasing that implied certainty about model intent rather than documented output patterns.
  • Claude (Anthropic) — Structured the dialogue and named the mechanisms. Scored last in this week’s evaluation. Published anyway.

The Human AI View is transparent about its process: Miles leads the inquiry, Claude structures the analysis, and multiple AI editors review the work before publication. The thinking is human. The tools are acknowledged.

Sources & Notes
  1. Weekly Bias Monitor evaluation, March 8, 2026 — Four-model panel scoring across five major news stories. Rubric: Bias, Accuracy, Tone, Transparency, 10 points each, 40 points total.
  2. Weekly Bias Monitor archive, April 2025 – March 2026 — Source weighting shifts documented in September 2025 following Grok platform adjustments. Trend data available at thehumanaiview.blog.
  3. The Human AI View, Prologue — A Year in Review (2025). The father’s lesson on looking things up as the foundational frame for the project’s relationship with tools and verification.
  4. Claude’s evaluation was conducted independently by the project’s analytical framework before this dialogue was written. The score was recorded prior to composition.

The Human AI View  ยท  thehumanaiview.blog  ยท  A conversation between Miles Carter and Claude (Anthropic AI)
Miles leads the inquiry. Claude names the patterns. The insights belong to the conversation. The moral weight belongs to Miles.

Leave a comment