Make illegally trained LLMs public domain as punishment

🃏Joker@sh.itjust.works · 17 hours ago

Make illegally trained LLMs public domain as punishment

FaceDeer@fedia.io · 14 hours ago

Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.

just_another_person@lemmy.world · 13 hours ago

Not really. The same way you can’t sell live and public performance music for profit and not get sued. Case law right there, and the fact it’s performance vs publicly published doesn’t matter. How the owner and originator classifies or licenses it is the defining classification. It’s going to be years before anyone sees this get a ruling in court though.

FaceDeer@fedia.io · 13 hours ago

That’s not what’s going on here, though. The LLM model doesn’t contain the actual copyrighted data, it’s the result of analyzing the copyrighted data.

An analogous example would be a site like TV Tropes. TV Tropes doesn’t contain the works that it’s discussing, it just contains information about those works.

Superb@lemmy.blahaj.zone · 8 hours ago

No, the model does retain the original works in a lossy compression. This is evidenced by the fact that you can get a model to reproduce sections of its training data

FaceDeer@fedia.io · 7 hours ago

You’re probably thinking of situations where overfitting occurred. Those situations are rare, and are considered to be errors in training. Much effort has been put into eliminating that from modern AI training, and it has been successfully done by all the major players.

This is an old no-longer-applicable objection, along the lines of “AI can’t do fingers right”. And even at the time, it was only very specific bits of training data that got inadvertently overfit, not all of it. You couldn’t retrieve arbitrary examples of training data.

just_another_person@lemmy.world · 13 hours ago

Did you not read my original comment before responding?

FaceDeer@fedia.io · 11 hours ago

You said:

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?

They pulled a very public and out in the open data heist and got away with it.

They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.

just_another_person@lemmy.world · 10 hours ago

You’re thinking of licensing as a person putting something online WITH a license.

The terminology in this case is whether or not it was LICENSED by the commercial entity using and selling it’s derivative. That is the default. The burden is on the commercial entity to prove they were the original creator of said content. It is by default plagiarism otherwise, and this is also the default.

Here’s an example: I write a story and post it online, and it is specific to a toothbrush and toilet scrubber falling in love, and then having dish scrubber pads as children. I say the two main characters are called Dennis and Fran, and their children are called Denise and Francesca. Then somebody goes to prompt OpenAI for a similar and it kicks out the exact same story with the same names, I would win that case based on it clearly being beyond a doubt plagiarism.

Unless you as OpenAI can prove these are all completely random-which they aren’t because it’s trained on my data-then I would be deemed the original creator of that story, and any sales of that data I would be entitled to.

Proving that is a different thing, but that’s what the laws say should happen. If they didn’t contact me to license that story, it’s still plagiarism. Same with music, movies…etc.

catloaf@lemm.ee · 11 hours ago

The product of that analysis does not contain the data itself, and so is not a violation of copyright.

That’s your opinion, not the opinion of a court or legislature. LLM products are directly derived from and dependent upon the training data, so it is positively considered a derivative work. However, whether it’s considered sufficiently transformative, or whether it passes the fair use test, has not to my knowledge been determined in court. (Note that I am assuming US law here.)

FaceDeer@fedia.io · 11 hours ago

The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.

The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.

catloaf@lemm.ee · 9 hours ago

That’s not in question. It doesn’t need to contain the training data to be a derivative work, and therefore a potential infringement.

FaceDeer@fedia.io · 8 hours ago

You’ve got your definition of “derivative work” wrong. It does indeed need to contain copyrightable elements of another work for it to be a derivative work.

If I took a copy of Harry Potter, reduced it to a fine slurry, and then made a paper mache sculpture out of it, that’s not a derivative work. None of the copyrightable elements of the book survived.