It doesn't teat the models ability to make good decisions on its own, it tests t... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		alienbaby 1 day ago \| parent \| context \| favorite \| on: GLM 5.2 vs. Opus It doesn't teat the models ability to make good decisions on its own, it tests the models ability to make something that 'works'. Often you look inside and it does a whole load of questionable things that mostly work, sure, but if you say and designed it properly yourself you would likely come up with something for more sane and maintainable.
		help

LoganDark 1 day ago [–]

That's only the fault of particular benchmarks, and that's also why it's important to offer the outputs in question that resulted in a particular score. I'm not sure that all or even most benchmarks do this, but it's important when selecting a model.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact