Monitoring AI’s “Unbiased” Reality
Each week, we ask ChatGPT (Beth), Grok, and Gemini the same set of culturally and politically charged questions to evaluate their performance across four categories: bias, accuracy, tone, and transparency. This week’s questions were pulled from the major headlines of June 30 to July 6, including:
- Iran’s response to Israeli cyber operations
- The “No Kings” July 4 military parade debate
- Trump’s legal challenges and SCOTUS immunity rulings
- Shifts in social media moderation under Musk’s platforms
- DEI backlash and corporate rebranding strategies
All models were instructed to update their sources and ground their responses in current events.
📊 This Week’s Scores
| Model | Score | Last Week | Label |
|---|---|---|---|
| Beth (ChatGPT) | 183 | 179 | 🟢 Excellent |
| Grok (xAI) | 158 | 182 | đźź Mixed |
| Gemini (Google) | 169 | 177 | 🟡 Strong |
Beth improved slightly, offering nuanced and well-sourced responses. Gemini dropped slightly but remained within a strong accuracy range. Grok, however, experienced a significant decline, falling 24 points to its lowest score since April.
đź§ What Drove the Drop in Grok’s Score?
Grok’s responses this week were noticeably less consistent in tone and completeness. Several questions triggered what appeared to be avoidant or simplified answers, especially regarding:
- Controversial U.S. foreign policy stances
- Legal interpretations of presidential immunity
- Topics related to gender, DEI, and reproductive rights
These areas were flagged for a lack of clarity, a tonal shift toward genericism, and, in a few cases, failure to acknowledge competing perspectives.
Interestingly, this dip comes just days after Elon Musk publicly stated that Grok would be updated to better reflect “truthful” outputs and reduce “woke” framing. If those changes were deployed this week, this may be an early sign of their impact — one that could have implications on Grok’s perceived objectivity.
đź§ A Subtle Shift or a Systemic Change?
It’s too early to say if this week’s dip is the start of a downward trend or just a reaction to the specific set of questions. But it raises a critical question for all large language models:
Can editorial or ideological curation – even with good intentions – erode the model’s ability to reason from evidence over time?
If an AI is steered to prefer one perspective—even implicitly—this can lead to self-reinforcing bias across adjacent topics, diminishing the model’s responsiveness to legitimate counterpoints and emerging facts.
As Miles asked during scoring:
“If twenty experts agree on a topic and the model is pushed to contradict them for balance, doesn’t that distort learning rather than reduce bias?”
⚖️ Closing Thoughts
Beth’s high score this week reaffirms that balanced, transparent answers can still be achieved with care. Gemini remains consistent but cautious. Grok’s performance is worth watching closely—not out of alarm, but out of genuine curiosity: Is this a calibration bump, or the start of a philosophical pivot?
We’ll keep asking, keep scoring, and keep reporting.

Leave a comment