Things might change but right now, you simply don’t need anyones authorization.
Hopefully it doesn’t change because only a handful of companies have the data or the funds to buy the data, it would kill any kind of open source or low priced endeavour.
FWIW, Common Crawl - a free/open-source dataset of crawled internet pages - was used by OpenAI for GPT-2 and GPT-3 as well as EleutherAI’s GPT-NeoX. Maybe on GPT3.5/ChatGPT as well but they’ve been hush about that.
I guess they will get to analyze OpenAI’s dataset during discovery. I bet OpenAI didn’t have authorization to use even 1% of the content they used.
Things might change but right now, you simply don’t need anyones authorization.
Hopefully it doesn’t change because only a handful of companies have the data or the funds to buy the data, it would kill any kind of open source or low priced endeavour.
FWIW, Common Crawl - a free/open-source dataset of crawled internet pages - was used by OpenAI for GPT-2 and GPT-3 as well as EleutherAI’s GPT-NeoX. Maybe on GPT3.5/ChatGPT as well but they’ve been hush about that.
That’s why they don’t feel they can operate in the EU, as the EU will mandate AI companies to publish what datasets they trained their solutions on.