I don't know if it's ethically better to use LLMs trained on data licensed from ...

simonw · 2026-06-23T01:37:49 1782178669

Pleias is definitely the most promising.

I haven't seen a model trained on that corpus since late 2024: https://simonwillison.net/2024/Dec/5/pleias-llms/ - I may have missed something though.

reedciccio · 2026-06-23T06:20:52 1782195652

I think their business model is to work on specialized models rather than general purpose. See https://pleias.ai/blog/sillon-ratp and the focus on synthetic datasets for specific personas.

Dorialexander · 2026-06-23T10:42:15 1782211335

hi, so Pleias co-founder here.

Common Corpus is commonly used now in pretraining, including by close labs, but rarely as the only source (which would be the actual ethical commitment).

Like most people training efficient models we move toward synthetic pretraining, but still maintaining our committment for data research and releasability. Lead current project is SYNTH, for now based on Wikipedia but we'll generalize with seeds from Common Corpus.