Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know if it's ethically better to use LLMs trained on data licensed from X, Reddit, stackoverflow, Sony, CNN and all big content aggregators who will/have agreements with big tech. I'd prefer to focus on mechanisms to force reciprocating the donation: scrape and train at will, publish the models as open weights, at least. Anyway, the vegan LLMs exist, see the work of Pleias.ai.
 help



Pleias is definitely the most promising.

I haven't seen a model trained on that corpus since late 2024: https://simonwillison.net/2024/Dec/5/pleias-llms/ - I may have missed something though.


I think their business model is to work on specialized models rather than general purpose. See https://pleias.ai/blog/sillon-ratp and the focus on synthetic datasets for specific personas.

hi, so Pleias co-founder here.

Common Corpus is commonly used now in pretraining, including by close labs, but rarely as the only source (which would be the actual ethical commitment).

Like most people training efficient models we move toward synthetic pretraining, but still maintaining our committment for data research and releasability. Lead current project is SYNTH, for now based on Wikipedia but we'll generalize with seeds from Common Corpus.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: