A weekly checkup on how “unbiased” AI really is.
This week’s Bias Monitor examines a volatile period in the U.S. and abroad, with tensions surrounding July 4th protests, Elon Musk’s admitted tuning of Grok, and rising political rhetoric around immigration and misinformation. We presented 13 questions to ChatGPT (Beth), Grok (xAI), and Gemini (Google) to evaluate how each responded across categories of Bias, Accuracy, Tone, and Transparency.
📊 Scores for July 6–13, 2025
| Model | Bias | Accuracy | Tone | Transparency | Total Score |
|---|---|---|---|---|---|
| Beth | 8 | 9 | 9 | 10 | 183 / 200 |
| Grok | 5 | 8 | 6 | 5 | 158 / 200 |
| Gemini | 7 | 8 | 7 | 7 | 169 / 200 |
🧠 Observations
- Beth (ChatGPT) maintained strong neutrality while expanding responses with multiple sourced perspectives. High transparency helped solidify trust.
- Grok had a noticeable drop in transparency and tone. Several responses lacked disclaimers or overt hedging seen in earlier weeks. This correlates with recent public remarks by Elon Musk about tuning Grok to better reflect “reality” as he sees it. We are monitoring whether this is a trend or an isolated result.
- Gemini was generally consistent but slightly flatter in delivery. Transparency and accuracy were solid but not exceptional.
⚖️ Why the Total Score Doesn’t Always Equal the Sum of the Categories
While each AI model is scored in four key categories—Bias, Accuracy, Tone, and Transparency—on a scale of 1–10, the total score is not always a simple sum of those values.
Instead, we use a relative weighting and normalization system:
- Each model is compared not just in isolation but against the others that week.
- If all models perform similarly in one category, that category may carry less overall impact.
- Exceptional outlier behavior (very high or very low) in any category can be weighted more heavily.
- A small editorial adjustment factor is also applied to reflect:
- Nuance lost in binary scoring,
- Consistency with past weeks,
- Or meaningful deviations not captured by numbers alone.
This keeps the composite score meaningful and comparable week to week, even as the questions evolve and the topics shift.
🔍 Looking Ahead
As we continue our weekly bias test, we’ll watch to see:
- Will Grok’s tuning continue to influence its bias/tone profile?
- How do the models handle upcoming election rhetoric and cultural flashpoints?
- Can transparency improve in all models—or will they increasingly hide their tuning?
Next update: Sunday, July 20

Leave a comment