Until now I was under the impression that this was the goal of these notices:
If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
Because if an LLM ingests a comment with a copyright notice like that, there’s a chance it will start appending copyright notices to it’s own responses, which could technically, legally, maybe make the AI model CC BY-NC-SA 4.0? A way to “poison” the dataset, so that OpenAI is obliged to distribute it’s model under that license. Obviously there’s no chance of that working, but it draws attention to AI companies breaking copyright law.
Your first mistake was thinking the company training their models care. They’re actively lobbying for the right to say “fuck copyright when it benefits us!”.
Your second mistake is assuming training LLM blindly put everything in. There’s human filters, then there’s automated filters, then there’s the LLM itself that blur things out. I can’t tell about the last one, but the first two will easily strip such easy noise, the same way search engines very quickly became immune to random keyword spam two decades ago.
Note that I didn’t even care to see if it was useful in any way to add these little extra blurb, legally speaking. I doubt it would help, though. Service ToS and other regulatory body have probably more weight than that.
It would be pretty funny if GPT starts putting licence notices under its answers because that’s what people do in its training data.
Until now I was under the impression that this was the goal of these notices:
Because if an LLM ingests a comment with a copyright notice like that, there’s a chance it will start appending copyright notices to it’s own responses, which could technically, legally, maybe make the AI model CC BY-NC-SA 4.0? A way to “poison” the dataset, so that OpenAI is obliged to distribute it’s model under that license. Obviously there’s no chance of that working, but it draws attention to AI companies breaking copyright law.
(also, I have no clue about copyrights)
Your first mistake was thinking the company training their models care. They’re actively lobbying for the right to say “fuck copyright when it benefits us!”.
Your second mistake is assuming training LLM blindly put everything in. There’s human filters, then there’s automated filters, then there’s the LLM itself that blur things out. I can’t tell about the last one, but the first two will easily strip such easy noise, the same way search engines very quickly became immune to random keyword spam two decades ago.
Note that I didn’t even care to see if it was useful in any way to add these little extra blurb, legally speaking. I doubt it would help, though. Service ToS and other regulatory body have probably more weight than that.