I don't know if it's ethically better to use LLMs trained on data licensed from X, Reddit, stackoverflow, Sony, CNN and all big content aggregators who will/have agreements with big tech. I'd prefer to focus on mechanisms to force reciprocating the donation: scrape and train at will, publish the models as open weights, at least.
Anyway, the vegan LLMs exist, see the work of Pleias.ai.
I think their business model is to work on specialized models rather than general purpose. See https://pleias.ai/blog/sillon-ratp and the focus on synthetic datasets for specific personas.
Common Corpus is commonly used now in pretraining, including by close labs, but rarely as the only source (which would be the actual ethical commitment).
Like most people training efficient models we move toward synthetic pretraining, but still maintaining our committment for data research and releasability. Lead current project is SYNTH, for now based on Wikipedia but we'll generalize with seeds from Common Corpus.