I have the pro account for ChatGPT, Claude, Gemini, and Grok. They all have vari...

e9 · 2026-06-08T18:03:59 1780941839

I recently worked with NRC dataset, specifically about nuclear reactor events and status reports(example: https://www.nrc.gov/reading-rm/doc-collections/event-status/...). Public data that just needed some cleaning. Several time Claude API would refuse to engage. Because of that I can't trust Claude to clean production data sets.

emodendroket · 2026-06-08T20:59:16 1780952356

> 1. It seems to be the best at understanding current events. Maybe due to X integration, or some other tool call optimization in the backend? I don't know, but I often ask about things going on, and the other models have outdated info, give unhelpful answers, etc.

That makes sense, but occasionally you ask about an issue where it's clearly received political instruction from the commissar and it acts totally lobotomized. But it's true that Gemini will often blithely state that something could never happen and you'll say "what do you mean, that just happened" and then it comes back apologizing after running a Web search.

timfsu · 2026-06-09T00:22:04 1780964524

We saw this too with Gemini specifically. My favorite example - we built a hallucination detector (given the input, does the output make any false claims) in Gemini, and after the Seahawks won the Superbowl in February, it would consistently flag that as "not possible".

emodendroket · 2026-06-09T03:53:39 1780977219

I believe it was assuring me the Israelis would never invade southern Lebanon and declare a buffer zone inside it after that had already happened.

sawjet · 2026-06-09T10:29:31 1781000971

Do you have an example of this?

emodendroket · 2026-06-10T14:15:06 1781100906

Which "this"?

deaton · 2026-06-08T18:26:41 1780943201

All 4 of these still regularly insist that I am a genius and everything I say is brilliant. Grok definitely pushes back more than the others, but I don't like how sycophantic they all still are.

pell · 2026-06-08T19:42:59 1780947779

I don’t want to open up that whole can of worms but Grok on any vaguely philosophical or political topic is a scaredy cat and has a very hard time staying factual if it could make Musk or the conservative movement appear negatively.

square_usual · 2026-06-08T18:31:49 1780943509

Opus 4.8 has made huge jumps in being less sycophantic. I see it pushing back on ideas a lot, and that's very helpful when you're evaluating options.

lachlan_gray · 2026-06-08T18:43:07 1780944187

Almost too much so, it often feels like opus is pushing back for the sake of pushing back. The way old models used to add disclaimers to every message regardless of content

NewJazz · 2026-06-08T18:55:16 1780944916

That's because it can't literally reason, it has just been manually steered into those reasoning speech cycles.

emodendroket · 2026-06-08T21:00:02 1780952402

Yes, yes. Does everyone still find it interesting to go over this point every time about how it's not literally a person with human reasoning?

NewJazz · 2026-06-08T22:07:53 1780956473

Uh, only when people don't seem to understand it, or try to personify it. Which is quite often.

CamperBob2 · 2026-06-09T03:00:55 1780974055

What about when they ask how you can take gold at IMO and solve research-level math problems without reasoning?

emodendroket · 2026-06-09T01:54:52 1780970092

People “personify” their cars but I don’t think because they think cars have human cognition

lmm · 2026-06-09T05:49:43 1780984183

People are weird about their cars and make major errors in judgement as a result (e.g. we tolerate incredibly high rates of people getting killed because they were "hit by a car", as though the driver had nothing to do with it). Pushing back on that is absolutely worthwhile.

emodendroket · 2026-06-09T14:35:16 1781015716

Which has approximately zero to do with the anthropomorphization of the car itself. I could have chosen a different machine or tool to make my point.

lmm · 2026-06-09T21:57:21 1781042241

> Which has approximately zero to do with the anthropomorphization of the car itself.

You don't think people talking about the car doing things has anything to do with anthropomorphising the car?

emodendroket · 2026-06-11T12:51:49 1781182309

No, in general I don't buy this idea that if we start using awkward phrases like "died by suicide" everywhere or avoiding phrases like "car accident" (which, despite what advocates claim, is a literally accurate description of unintentionally hitting someone or something with your car) but avoid changing any of the circumstances that cause the behavior it changes anything.

lmm · 2026-06-12T02:58:30 1781233110

That's a completely different claim from the one you were making in your previous comment.

> avoid changing any of the circumstances that cause the behavior

The normalisation of unsafe driving is the circumstance that causes the behaviour. Just look at how the cultural shift in how drink-driving is perceived over the last few decades has changed the rate of it happening.

NewJazz · 2026-06-09T02:26:10 1780971970

Not in the same way.

emodendroket · 2026-06-09T03:52:52 1780977172

That doesn't seem to be much more than special pleading without an explanation of how you think it's different.

galkk · 2026-06-09T03:10:30 1780974630

It’s more like Opus wants you to do its job for it. I feel that amount of time when I tell it “no, you do that” increases with each new version.

fragmede · 2026-06-09T09:17:06 1780996626

It was mind blowing the first time I got a refusal, and retorted "yes you can" and had that work, but now it's just another reason to move to a different model.

Traubenfuchs · 2026-06-09T10:02:57 1780999377

> Anthropic is getting here too.

I almost exclusively use claude for all my professional and private needs. In my experience it's really good at adhering to my wishes in regards to sycophancy and pushing back. If you really want to you can tell it to systematically push back on anything where pushback makes sense until it continues with the flow of conversation.

In my first therapy session, the answers were too long and contained multiple questions, spawning multiple threads of conversation. I told it to tone it down and only ever ask one question back, maybe two, if they are related. The answers got too short. I told it to make them "slightly longer" again and reached a sweet spot.

The conversation is yours to form! You need to find the "system prompts" and guidelines to give it that work for you.

nonethewiser · 2026-06-08T18:42:54 1780944174

What are you using it for? Im pretty surprised ChatGPT is your top model but maybe you arent using it for code.

fragmede · 2026-06-09T09:19:07 1780996747

codex-5.5 > Opus 4.7, imo.

htx80nerd · 2026-06-08T19:14:37 1780946077

My favorite was ChatGPT, and I still use it often, but it becomes way too 'hair splitting' argumentative too often over very minor non controversial topics. Like it's always going out of its way to "well actually..."

Grok used to be really really bad ~8 months ago or so, but it's gotten better.

ChatGPT team needs to turn down the 'disagree just because' factor by a lot.

cactusplant7374 · 2026-06-08T18:50:41 1780944641

But in terms of agentic coding? Dead last.

epolanski · 2026-06-08T17:59:14 1780941554

My SO works in audit/compliance and business Gemini definitely does not refuse to answer.

Azantys · 2026-06-08T18:12:32 1780942352

Career and personal advice from LLMs, not sure if thats your best bet

selicos · 2026-06-08T19:49:14 1780948154

1. It seeks to manipulate the information you see and your lens to the world. This is already partially true from independent and major publications.

As soon as we hand over searching out information to social media algorithms and LLM tools, we abandon our ability to see reality outside our direct vision.

Grok's ownership has already demonstrated capacity to influence major world elections and other events. You cannot trust it with this sort of information gathering and reporting.