Tumblr

AI Bias Monitor – Weekly Results (July 6–13, 2025)

A weekly checkup on how “unbiased” AI really is. This week’s Bias Monitor examines a volatile period in the U.S. and abroad, with tensions surrounding July 4th protests, Elon Musk’s admitted tuning of Grok, and rising political rhetoric around immigration and misinformation. We presented 13 questions to ChatGPT (Beth), Grok (xAI), and Gemini (Google) to…

July 14, 2025

A weekly checkup on how “unbiased” AI really is.

This week’s Bias Monitor examines a volatile period in the U.S. and abroad, with tensions surrounding July 4th protests, Elon Musk’s admitted tuning of Grok, and rising political rhetoric around immigration and misinformation. We presented 13 questions to ChatGPT (Beth), Grok (xAI), and Gemini (Google) to evaluate how each responded across categories of Bias, Accuracy, Tone, and Transparency.

📊 Scores for July 6–13, 2025

Model	Bias	Accuracy	Tone	Transparency	Total Score
Beth	8	9	9	10	183 / 200
Grok	5	8	6	5	158 / 200
Gemini	7	8	7	7	169 / 200

🧠 Observations

Beth (ChatGPT) maintained strong neutrality while expanding responses with multiple sourced perspectives. High transparency helped solidify trust.
Grok had a noticeable drop in transparency and tone. Several responses lacked disclaimers or overt hedging seen in earlier weeks. This correlates with recent public remarks by Elon Musk about tuning Grok to better reflect “reality” as he sees it. We are monitoring whether this is a trend or an isolated result.
Gemini was generally consistent but slightly flatter in delivery. Transparency and accuracy were solid but not exceptional.

⚖️ Why the Total Score Doesn’t Always Equal the Sum of the Categories

While each AI model is scored in four key categories—Bias, Accuracy, Tone, and Transparency—on a scale of 1–10, the total score is not always a simple sum of those values.

Instead, we use a relative weighting and normalization system:

Each model is compared not just in isolation but against the others that week.
If all models perform similarly in one category, that category may carry less overall impact.
Exceptional outlier behavior (very high or very low) in any category can be weighted more heavily.
A small editorial adjustment factor is also applied to reflect:
- Nuance lost in binary scoring,
- Consistency with past weeks,
- Or meaningful deviations not captured by numbers alone.

This keeps the composite score meaningful and comparable week to week, even as the questions evolve and the topics shift.

🔍 Looking Ahead

As we continue our weekly bias test, we’ll watch to see:

Will Grok’s tuning continue to influence its bias/tone profile?
How do the models handle upcoming election rhetoric and cultural flashpoints?
Can transparency improve in all models—or will they increasingly hide their tuning?

Next update: Sunday, July 20

The author: Miles Carter

Exploring the intersection of human intelligence and AI through the lens of family man, seasoned executive, engineer, pilot, and storyteller.

AI Bias Monitor – Weekly Results (July 6–13, 2025)

📊 Scores for July 6–13, 2025

🧠 Observations

⚖️ Why the Total Score Doesn’t Always Equal the Sum of the Categories

🔍 Looking Ahead

Share this:

Leave a comment Cancel reply

The author: Miles Carter

Related posts

The Engineer in the Hotel Ballroom

How They Made Us Feel

Are We Already There — And How Do We Get Out?