Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It doesn't teat the models ability to make good decisions on its own, it tests the models ability to make something that 'works'. Often you look inside and it does a whole load of questionable things that mostly work, sure, but if you say and designed it properly yourself you would likely come up with something for more sane and maintainable.
 help



That's only the fault of particular benchmarks, and that's also why it's important to offer the outputs in question that resulted in a particular score. I'm not sure that all or even most benchmarks do this, but it's important when selecting a model.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: