I like her and I get why creatives are panicking because of all the AI hype.
However:
In evidence for the suit against OpenAI, the plaintiffs claim ChatGPT violates copyright law by producing a “derivative” version of copyrighted work when prompted to summarize the source.
A summary is not a copyright infringement. If there is a case for fair-use it’s a summary.
The comic’s suit questions if AI models can function without training themselves on protected works.
A language model does not need to be trained on the text it is supposed to summarize. She clearly does not know what she is talking about.
IANAL though.
I guess they will get to analyze OpenAI’s dataset during discovery. I bet OpenAI didn’t have authorization to use even 1% of the content they used.
Things might change but right now, you simply don’t need anyones authorization.
Hopefully it doesn’t change because only a handful of companies have the data or the funds to buy the data, it would kill any kind of open source or low priced endeavour.
FWIW, Common Crawl - a free/open-source dataset of crawled internet pages - was used by OpenAI for GPT-2 and GPT-3 as well as EleutherAI’s GPT-NeoX. Maybe on GPT3.5/ChatGPT as well but they’ve been hush about that.
That’s why they don’t feel they can operate in the EU, as the EU will mandate AI companies to publish what datasets they trained their solutions on.
SS is such a tool. Does anybody remember the big anti-gay speech that launched her career in The Way of the Gun? She’ll do anything to get ahead.
Here’s the speech: https://www.youtube.com/watch?v=PAl5xGi7urQ
You hate her because of a part in a shitty movie?
Did I say hate? I said she’s a tool.
Here is an alternative Piped link(s): https://piped.video/watch?v=PAl5xGi7urQ
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I’m open-source, check me out at GitHub.
Good piped bot
Like the record labels sued every music sharing platform in the early days. Adapt. They’re all afraid of new things but in the end nobody can stop it. Think, learn, work with it, not against it.
I think it’s valid. This isn’t about the tech, but the sources of your work.
Of course it’s valid. And the misuse of AI has to be fight. Nevertheless we have to think differently in the face of something we cannot stop in the long run. You cannot create a powerful tool and only misuse it. I miscommunicated here, should’ve explained myself, I got no excuses, maybe one: I sat on the shitter and wanted to make things short.
Feels like a publicity play
I feel like when confronted about a “stolen comedy bit” a lot of these people complaining would also argue that “no work is entirely unique, everyone borrows from what already existed before.” But now they’re all coming out of the woodwork for a payday or something… It’s kinda frustrating especially if they kill any private use too…
The issue isn’t that people are using others works for ‘derivative’ content.
The issue is that, for a person to ‘derive’ comedy from Sarah Silverman the ‘analogue’ way, you have to get her works legally, be that streaming her comedy specials, or watching movies/shows she’s written for.
With chat GPT and other AI, its been ‘trained’ on her work (and, presumably as many other’s works as possible) once, and now there’s no ‘views’, or even sources given, to those properties.
And like a lot of digital work, its reach and speed is unprecedented. Like, previously, yeah, of course you could still ‘derive’ from people’s works indirectly, like from a friend that watched it and recounted the ‘good bits’, or through general ‘cultural osmosis’. But that was still limited by the speed of humans, and of culture. With AI, it can happen a functionally infinite number of times, nearly instantly.
Is all that to say Silverman is 100% right here? Probably not. But I do think that, the legality of ChatGPT, and other AI that can ‘copy’ artist’s work, is worth questioning. But its a sticky enough issue that I’m genuinely not sure what the best route is. Certainly, I think current AI writing and image generation ought to be ineligible for commercial use until the issue has at least been addressed.
The issue is that, for a person to ‘derive’ comedy from Sarah Silverman the ‘analogue’ way, you have to get her works legally, be that streaming her comedy specials, or watching movies/shows she’s written for.
Damn did they already start implanting DRM bio-chips in people?
And like a lot of digital work, its reach and speed is unprecedented. Like, previously, yeah, of course you could still ‘derive’ from people’s works indirectly, like from a friend that watched it and recounted the ‘good bits’, or through general ‘cultural osmosis’.
Please explain why you cannot download a movie/episode/ebook illegally and then directly derive from it.
I mean, you can do that, but that’s a crime.
Which is exactly what Sarah Silverman is claiming ChatGPT is doing.
And, beyond a individual crime of a person reading a pirated book, again, we’re talking about ChatGPT and other AI magnifying reach and speed, beyond what an individual person ever could do even if they did nothing but read pirated material all day, not unlike websites like The Pirate Bay. Y’know, how those website constantly get taken down and have to move around the globe to areas where they’re beyond the reach of the law, due to the crimes they’re doing.
I’m not like, anti-piracy or anything. But also, I don’t think companies should be using pirated software, and my big concern about LLMs aren’t really for private use, but for corporate use.
I mean, you can do that, but that’s a crime.
Consuming content illegally is by definition a crime, yes. It also has no effect on your output. A summary or review of that content will not be infringing, it will still be fair use.
A more substantial work inspired by that content could be infringing or not depending on how close it is to the original content but not on the legality of your viewing of that content.
Nor is it relevant. If you have any success with your copy you are going to cause way more damage to the original creator than pirating one copy.
And, beyond a individual crime of a person reading a pirated book, again, we’re talking about ChatGPT and other AI magnifying reach and speed, beyond what an individual person ever could do even if they did nothing but read pirated material all day, not unlike websites like The Pirate Bay. Y’know, how those website constantly get taken down and have to move around the globe to areas where they’re beyond the reach of the law, due to the crimes they’re doing.
I can assure you that The Pirate Bay is quite stable. I would like to point out that none of AI vendors has been actually convicted of copyright infringement yet. That their use is infringing and a crime is your opinion.
It also going to be irrelevant because there are companies that do own massive amounts of copyrighted materials and will be able to train their own AIs, both to sell as a service and to cut down on labor costs of creating new materials. There are also companies that got people to agree to licensing their content for AI training such as Adobe.
So copyright law will not be able to help creators. So there will be a push for more laws and regulators. Depending on what they manage to push through you can forget non major corp backed AI, reduced fair use rights (as in unapproved reviews being de-facto illegal) and perhaps a new push against software that could be used for piracy such as non-regulated video or music players, nevermind encoders etc.
Consuming content illegally is by definition a crime, yes.
What law makes it illegal to consume an unauthorized copy of a work?
That’s not a flippant question. I am being absolutely serious. Copyright law prohibits the creation and distribution of unauthorized copies; it does not prohibit the reception, possession, or consumption of those copies. You can only declare content consumption to be “illegal” if there is actually a law against it.
What law makes it illegal to consume an unauthorized copy of a work?
That’s not a flippant question. I am being absolutely serious. Copyright law prohibits the creation and distribution of unauthorized copies; it does not prohibit the reception, possession, or consumption of those copies. You can only declare content consumption to be “illegal” if there is actually a law against it.
Which legal system?
She’s an American actor, suing an American company, so I think we should discuss the laws of Botswana, Mozambique, and Narnia. /s
Consuming content illegally is by definition a crime, yes. It also has no effect on your output. A summary or review of that content will not be infringing, it will still be fair use.
That their use is infringing and a crime is your opinion.
“My opinion”? have you read the headline? Its not my opinion that matters, its that of the prosecution in this lawsuit. And this lawsuit indeed alleges that copyright infringement has occurred; it’ll be up to the courts to see if the claim holds water.
I’m definitely not sure that GPT4 or other AI models are copyright infringing or otherwise illegal. But, I think that there’s enough that seems questionable that a lawsuit is valid to do some fact-finding, and honestly, I feel like the law is a few years behind on AI anyway.
But it seem plausible that the AI could be found to be ‘illegally distributing works’, or otherwise have broken IP laws at some point during their training or operation. A lot depends on what kind of agreements were signed over the contents of the training packages, something I frankly know nothing about, and would like to see come to light.
“My opinion”? have you read the headline? Its not my opinion that matters, its that of the prosecution in this lawsuit. And this lawsuit indeed alleges that copyright infringement has occurred; it’ll be up to the courts to see if the claim holds water.
No, the opinion that matters is the opinion of the judge. Before we have a decision, there is no copyright infringement.
I’m definitely not sure that GPT4 or other AI models are copyright infringing or otherwise illegal. But, I think that there’s enough that seems questionable that a lawsuit is valid to do some fact-finding You sure speak as if you do.
and honestly, I feel like the law is a few years behind on AI anyway.
But it seem plausible that the AI could be found to be ‘illegally distributing works’, or otherwise have broken IP laws at some point during their training or operation. A lot depends on what kind of agreements were signed over the contents of the training packages, something I frankly know nothing about, and would like to see come to light.
I 've said in my previous post that copyright will not solve the problems, what you describe as it being behind AI. Considering how the laws regarding copyright ‘caught up with the times’ in the beginning of the internet… I am not optimistic the changes will be beneficial to society.
In evidence for the suit against OpenAI, the plaintiffs claim ChatGPT violates copyright law by producing a “derivative” version of copyrighted work when prompted to summarize the source.
Both filings make a broader case against AI, claiming that by definition, the models are a risk to the Copyright Act because they are trained on huge datasets that contain potentially copyrighted information
They’ve got a point.
If you ask AI to summarize something, it needs to know what it’s summarizing. Reading other summaries might be legal, but then why not just read those summaries first?
If the AI “reads” the work first, then it would have needed to pay for it. And how do you deal with that? Is a chatbot treated like one user? Or does it need to pay for a copy for each human that asks for a summary?
I think if they’d have paid for a single ebbok Library subscription they’d be fine. However the article says they used pirate libraries so it could read anything on the fly.
Pointing an AI at pirated media is going to be hard to defend in court. And a class action full of authors and celebrities isn’t going to be a cakewalk. They’ve got a lot of money to fight, and have lots of contacts for copyright laws. I’m sure all the publishers are pissed too.
Everyone is going after AI money these days, this seems like the rare case where it’s justified
If the AI “reads” the work first, then it would have needed to pay for it
That’s not actually true. Copyright applies to distribution, not consumption. You violate no law when I create an unauthorized copy of a work, and you read that copy. Copyright law prohibits you from distributing further copies, but it does not prohibit you from possessing the copy I provided you, nor are you prohibited from speaking about the copy you have acquired.
Unless the AI is regurgitating substantial parts of the original work, it’s output is a “transformative derivation”, which is not subject to the protections of the original copyright. The AI is doing what English teachers ask of every school-age child: create a book report.
The US copyright office says this on their website
Uploading or downloading works protected by copyright without the authority of the copyright owner is an infringement of the copyright owner’s exclusive rights of reproduction and/or distribution.
If the company downloaded books without buying them to train their AI, that’s copyright infringement
The US copyright office says this on their website
Their website has zero legal precedence. It is an oversimplification that does not stand up to scrutiny.
The combined act of transmitting the work from uploader to downloader is infringing, but only the uploader’s actions conflict with copyright law. The downloader’s actions do not.
Copyright applies to distribution, not consumption. You violate no law when I create an unauthorized copy of a work
This is completely untrue. Making any unauthorised copy is an infringement of copyright. Hell, the UK determined that merely loading a pirated game into RAM was unauthorised copying, making the act of playing a pirated game unlawful - thankfully this is ruling only the case in the UK, however the basic principles of copyright are the same all over the world.
When you buy something, you get a limited license to make copies for the purpose of viewing the material. That license does not extend to making backup copies. However, in a practical sense, it is very unlikely you will be prosecuted for most kinds of infringement like this - particularly when no money is involved. It’s still infringement, though.
Edit: I will say though: you violate no law when you view a copy I create. However I would still be infringing for making and showing you the copy.
In the case of making a book report, that is educational, and thus fair use. ChatGPT is not educational - you might use it for education, but ChatGPT’s use of copyrighted material is for commercial enterprise.
The uploader is the person creating the copy. Downloading is not creating a copy; downloading is receiving a copy.
I would love to see a citation on that UK precedent, but as you said: “thankfully this is only the case in the UK” and does not apply in the rest of the world.
Making any unauthorised copy is an infringement of copyright.
The exceptions to that are so numerous that the statement is closer to false than truth. “Fair Use” blows the absolute nature of that statement out of the water.
There has never been a successful prosecution for downloading only.
Every single transfer of data is a copy. There is no such thing as moving data. Only copying it and then voluntarily deleting the original, to fake it having “moved”
Every single transmission of data is a copy. Receiving data is not. The person creating the copy is the sender, not the receiver.
eh it gets fuzzy. the sender transmits, but the receiver also writes a copy. it gets copied to the wire, and it gets copied from the wire. there is an ephemeral intermediate copy “on the wire”. I guess there’s no right answer; it’s like a fractal, the answer keeps changing when you look deeper
eh it gets fuzzy. the sender transmits, but the receiver also writes a copy
Got a Ring doorbell? A security camera? If I walk up to your camera and start playing a copyrighted work, have you infringed on copyright? Of course not. The recording you saved now contains a copy of the work, but you were privileged in recording at the time.
That doesn’t change when you ask me to come “send” that work to your camera. You are free to ask for something that I am not obligated to provide. If I choose to provide it, I am the infringing party, not you.
Downloading is no different. You ask me to use a specific protocol to send a specific work to a specific port at a specific address. I can choose to do that, or I can tell you to pound sand. If I choose to send it, I am the infringing party, not you.
The specific processes applied by the computer to save and replay the work would not qualify as “copying” under copyright law. If they did, viewing any copyrighted work would be an infringement, as the computer uses those same processes to view legitimate copies as illegitimate.
I feel you guys are arguing very precise legal matters without defining the jurisdiction. I mean sure, go ahead, but it’s meaningless. One could say “I live in this random country and we don’t even have a concept of copyright, therefore it does not exist!”
Sarah Silverman is an American actress. OpenAI is an American country. Relevant jurisdiction was defined in the headline.
We’re not talking about fair use though - which also is incredibly limited. It only applies to education, news or criticism. Fair use would be an authorised copy, by definition.
and does not apply in the rest of the world.
The specific ruling does not apply to the rest of the world, so there is no established precedent elsewhere that playing a pirated video game is an offense. This just means someone wishing to prosecute this offense would have no case law to back up their claim. However the principle that led to the ruling is the same - you need a license to make a copy (except for fair use, which as I say would rarely apply) and computers copy files internally in order to display their content.
The uploader is the person creating the copy. Downloading is not creating a copy; downloading is receiving a copy.
One person is providing a copy to someone else - that person is infringing copyright - and the person receiving is writing a copy to their device, and furthermore needs to make copies to display the content - that person is also infringing copyright.
You can’t open a file like you would a book. You need to copy and process the file in order to display it.
There has never been a successful prosecution for downloading only.
There have been no prosecutions for downloading only because the level of damages is so low that it isn’t worth the cost of going to court. That doesn’t make it less illegal, it’s just more likely you’ll get away with it.
You can’t open a file like you would a book. You need to copy and process the file in order to display it.
That precedent has never been set in the US. The “process” you’re talking about for a human to open a digital book is not considered “copying” under US law.
There have been no prosecutions for downloading only because the level of damages is so low
That is a theory. Not a very compelling one, given the level of pettiness we regularly see in the courts. The precedent of a successful prosecution for downloading would be extremely valuable to rights holders: it would have a chilling effect on the entire community of pirates. The reverse is also true: a failed prosecution would lend a great deal of legitimacy to piracy for personal consumption.
The actual reason why rights holders aren’t pressing cases against downloaders is because they know they will fail. Copyright law is not written or interpreted in such a way as to enable prosecution of people for receiving a work, or even for requesting a work be sent to them. Copyright law envisions pirate distributors, not consumers.
There was still copyright infringement because the company probably downloaded the text (which created another copy) and modified it (alteration is also protected by copyright) before using it as training data. If you write an original novel and admit that you had pirated a bunch of novels to use for reference, those novels were still downloaded illegally even if you’ve deleted them by now. The AI isn’t copyright infringement itself, it’s proof that copyright infringement has happened.
But personally I don’t think the actual laws will matter so much as which side has the better case for why they will lead to more innovation and growth for the economy.
There was still copyright infringement because the company probably downloaded the text (which created another copy)
Sure, someone likely infringed on copyright for that copy to be created, but the person/entity committing that infringement is the sender, not the receiver. The uploader is the infringing party, not the downloader.
If you write an original novel and admit that you had pirated a bunch of novels to use for reference, those novels were still downloaded illegally even if you’ve deleted them by now.
They were uploaded illegally. The people who distributed those copies to me have infringed on copyright, sure. My receiving those copies does not constitute infringement. Uploading is the illegal act, not downloading.
My work does not violate copyright, unless I use a substantial part of the other works. But, if I used substantial parts of those works, my work would be some sort of “derivation” and not the “original novel” you declared it. (Many types of derivation fall within “fair use” and do not constitute infringement.)
Whether I delete the works or not is entirely irrelevant. I am prohibited from creating and distributing additional copies, but I am not prohibited from receiving, possessing, or consuming an unauthorized copy.
The uploader is the infringing party, not the downloader.
an exclusive right of the copyright holder is the right to duplicate their work. downloading IS illegal because you’re creating an unauthorized duplicate of the work on your machine. your duplicate is distinct from the duplicate that someone else had created and uploaded. it’s just very hard to get caught downloading, and it’s not very cost effective for companies to pursue since they would only stop one person. that’s why most companies like the RIAA targeted torrents for their lawsuits, because they could easily see the ip addresses (which is why you should always use a vpn when torrenting) and because they could shut down uploaders. but downloading itself is still very illegal
My work does not violate copyright, unless I use a substantial part of the other works.
like I said, the AI is not a violation (probably, unless the courts later disagree), it’s proof that unauthorized duplication of copyrighted works has occurred, and that is illegal
You cannot create a copy of a work that you do not possess. The downloader does not possess the work to create a copy. Only the uploader is even capable of creating the copy. The downloader cannot create a copy; he can only request.
If he does something else with that copy he receives, he becomes something other than merely a downloader. That “something else” could be unlawful, but that “something else” is not “downloading”.
It could be unlawful if the downloader gains unauthorized access to the computer system, but that would not be a copyright violation. It could be unlawful if the downloader conspires with the uploader, but the degree of collaboration would have to be much greater to support a conspiracy charge.
Downloading does not meet the statutory criteria for copyright infringement. Downloading alone is not infringement.
They get people torrenting movies by saying you seed while you leach…
So if they torrented them in mass, they broke it.
Exactly: seeding is uploading, and uploading can be infringement. So, if your torrent client seeded any part of the work to anyone, that could be considered infringement.
But, there is no evidence that ChatGPT received the works in question via torrent, and even if there was, there is no evidence that they actually seeded anything back to the swarm. Hell, there’s no evidence that ChatGPT even actually possesses the works in question.
Can the sources where ChatGPT got it’s information from be traced? What if it got the information from other summaries?
I think the hardest thing for these companies will be validating the information their AI is using. I can see an encyclopedia-like industry popping up over the next couple years.
Btw I know very little about this topic but I find it fascinating
Yes! They publish the data sources and where they got everything from. Diffusers (stable diffusion/midjoirny etc) and GPT both use tons of data that was taken in ways that likely violate that data’s usage agreement.
Imo they deserve whatever lawsuits they have coming.
likely violate that data’s usage agreement.
It doesn’t seem to be too common for books to include specific clauses or EULAs that prohibit their use as data in machine learning systems. I’m curious if there are really any aspects that cover this without it being explicitly mentioned. I guess we’ll find out.
I think with a book your standard digital license / copyright would forbid it, would it not?
Maybe. I’m interested in the specifics.
It depends on if the summary is an infringing derivative work, doesn’t it? Wikipedia is full of summaries, for example, and it’s not violating copyright.
If they illegally downloaded the works, that feels like a standalone issue to me, not having anything to do with AI.
Wikipedia is a non profit whose primary purpose is education. ChatGPT is a business venture.
A book review published in a newspaper is a commercial venture for the purpose of selling ads. The commercial aspect doesn’t make the review an infringement.
A summary is a “Transformative Derivation”. It is a related work, created for a fundamentally different purpose. It is a discussion about the work, not a copy of the work. Transformative derivations are not infringements, even where they are specifically intended to be used for commercial purposes.
A book review is most likely critical, and thus falls under fair use.
A summary is not critical, so would not have a fair use exemption. I would also disagree that it is transformative. That argument is about work that is so different to the original that it must be considered a separate piece (eg new music that uses a sample from old music). A summary is inherently not transformative, because it is merely a shortened version of the original - the ideas expressed are the same.
Transformative doesn’t mean that the idea is different. It means the purpose for expressing the idea is different. Informing an individual or the general public of the general idea presented in a book is not an infringement. If it were, every book report every student is ever asked to write would be an infringement.
https://en.m.wikipedia.org/wiki/Transformative_use
Transformativeness is a characteristic of such derivative works that makes them transcend, or place in a new light, the underlying works on which they are based.
A summary would not place the original work in a new light. A summary is the same work but shorter. A summary would be infringement.
Student book reports are for educational purposes, which has its own specific exemption under fair use. As does work which is critical of the original, along with news. A critical piece, for example, is transformative because it introduces new ideas, talking about the work and framing it in new ways.
AI meets none of these exemptions with a summary. It’s debatable whether it even could meet these exemptions in the way that it functions.
Student book reports are for educational purposes, which has its own specific exemption under fair use. As does work which is critical of the original, along with news. A critical piece, for example, is transformative because it introduces new ideas, talking about the work and framing it in new ways.
You’re forgetting two other important categories of fair use. Paste that student’s book report in a newspaper, and it is no longer “educational”, but it is still “news reporting”. “Author publishes work” is a newsworthy event.
Paste it in response to an individual asking about the work, and again, it is no longer educational, but it is still “commentary”, which is much the same as news reporting but with a typically smaller audience.
Even if these two categories of fair use were not specifically included in copyright law, they would naturally arises from the right to free speech. Making a summary subject to the original copyright would make it unlawful for anyone to even discuss the work at all.
She’s going to lose the lawsuit. It’s an open and shut case.
“Authors Guild, Inc. v. Google, Inc.” is the precedent case, in which the US Supreme Court established that transformative digitalization of copyrighted material inside a search engine constitutes as fair use, and text used for training LLMs are even more transformative than book digitalization since it is near impossible to reconstitute the original work barring extreme overtraining.
You will have to understand why styles can’t and should not be able to be copyrighted, because that would honestly be a horrifying prospect for art.
“Transformative” in this context does not mean simply not identical to the source material. It has to serve a different purpose and to provide additional value that cannot be derived from the original.
The summary that they talk about in the article is a bad example for a lawsuit because it is indeed transformative. A summary provides a different sort of value than the original work. However if the same LLM writes a book based on the books used as training data, then it is definitely not an open and shut case whether this is transformative.
But what an LLM does meets your listed definition of transformative as well, it indeed provides additional value that can’t be derive from the original, because everything it outputs is completely original but similar in style to the original that you can’t use to reconstitute the original work, in other words, similar to fan work, which is also why the current ML models, text2text or text2image, are called “transformers”. Again, works similar in style to the original cannot and should not be considered copyright infringement, because that’s a can of worm nobody actually wants to open, and the courts has been very consistent on that.
So, I would find it hard to believe that if there is a Supreme Court ruling which finds digitalizing copyrighted material in a database is fair use and not derivative work, that they wouldn’t consider digitalizing copyrighted material in a database with very lossy compression (that’s a more accurate description of what LLMs are, please give this a read if you have time) fair use as well. Of course, with the current Roberts court, there is always the chance that weird things can happen, but I would be VERY surprised.
There is also the previous ruling that raw transformer output cannot be copyrighted, but that’s beyond the scope of this post for now.
My problem with LLM outputs is mostly that they are just bad writing, and I’ve been pretty critical against “”“Open”""AI elsewhere on Lemmy, but I don’t see Siverman’s case going anywhere.
But what an LLM does meets your listed definition of transformative as well
No it doesn’t. Sometimes the output is used in completely different ways but sometimes it is a direct substitute. The most obvious example is when it is writing code that the user intends to incorporate into their work. The output is not transformative by this definition as it serves the same purpose as the original works and adds no new value, except stripping away the copyright of course.
everything it outputs is completely original
[citation needed]
that you can’t use to reconstitute the original work
Who cares? That has never been the basis for copyright infringement. For example, as far as I know I can’t make and sell a doll that looks like Mickey Mouse from Steamboat Willie. It should be considered transformative work. A doll has nothing to do with the cartoon. It provides a completely different sort of value. It is not even close to being a direct copy or able to reconstitute the original. And yet, as far as I know I am not allowed to do it, and even if I am, I won’t risk going to court against Disney to find out. The fear alone has made sure that we mere mortals cannot copy and transform even the smallest parts of copyrighted works owned by big companies.
I would find it hard to believe that if there is a Supreme Court ruling which finds digitalizing copyrighted material in a database is fair use and not derivative work
Which case are you citing? Context matters. LLMs aren’t just a database. They are also a frontend to extract the data from these databases, that is being heavily marketed and sold to people who might otherwise have bought the original works instead.
The lossy compression is also irrelevant, otherwise literally every pirated movie/series release would be legal. How lossy is it even? How would you measure it? I’ve seen github copilot spit out verbatim copies of code. I’m pretty sure that if I ask ChatGPT to recite me a very well known poem it will also be a verbatim copy. So there are at least some works that are included completely losslessly. Which ones? No one knows and that’s a big problem.
Let’s remove the context of AI altogether.
Say, for instance, you were to check out and read a book from a free public library. You then go on to use some of the book’s content as the basis of your opinions. More, you also absorb some of the common language structures used in that book and unwittingly use them on your own when you speak or write.
Are you infringing on copyright by adopting the book’s views and using some of the sentence structures its author employed? At what point can we say that an author owns the language in their work? Who owns language, in general?
Assuming that a GPT model cannot regurgitate verbatim the contents of its training dataset, how is copyright applicable to it?
Edit: I also would imagine that if we were discussing an open source LLM instead of GPT-4 or GPT-3.5, sentiment here would be different. And more, I imagine that some of the ire here stems from a misunderstanding of how transformer models are trained and how they function.
Let’s remove the context of AI altogether.
Yeah sure if you do that then you can say anything. But the context is crucial. Imagine that you could prove in court that I went down to the public library with a list that read “Books I want to read for the express purpose of mimicking, and that I get nothing else out of”, and on that list was your book. Imagine you had me on tape saying that for me writing is not a creative expression of myself, but rather I am always trying to find the word that the authors I have studied would use. Now that’s getting closer to the context of AI. I don’t know why you think you would need me to sell verbatim copies of your book to have a good case against me. Just a few passages should suffice given my shady and well-documented intentions.
Well that’s basically what LLMs look like to me.
I’m tired of internet arguments. If you are not going to make a good faith attempt to understand anything I said, then I see no point in continuing this discussion further. Good day.
deleted by creator
I know this is kind of a silly argument but storing protected work in our own human memories to recall later is certainly not reproduction.
I don’t think it’s reproduction for chat GPT to file away that information to call on it later. It’s just better at it than we are.
Personally I find this stupid. If we have robots walking around, are they going to be sued every time they see something that’s copywrited?
It’s this what will stop progress that could save us from environmental collapse? That a robot could summarize your shitty comedy?
Copywrite is already a disgusting mess, and still nobody cares about models being created specifically to manipulate people en mass. “What if it learned from MY creations” asks every self obsessed egoist in the world.
Doesn’t matter how many people this tech could save after another decade of development. Somebody think of the [lucky few artists that had the connections and luck to make a lot of money despite living in our soul crushing machine of a world]
All of the children growing up abused and in pain with no escape don’t matter at all. People who are sick or starving or homeless do no matter. Making progress to save the world from immanent environmental disaster doesn’t matter. Let Canada burn more and more every year. As long as copywrite is protected, all is well.
How do you figure that AI is the answer to environmental collapse? Don’t get me wrong, copyright law is stupid, but I guess I just don’t buy into all of the AI hype to the extent that others are.
I believe it will require a level and pace of informational processing that is far beyond what humans will accomplish alone. just having a system that can efficiently sift through the excess existing papers, and find correlations or contradictions would be amazing for development of new technology. if you are paying attention to any environmental sciences right now, it’s terrifying in an extremely real and tangible way. we will not outpace the collapse without an intense increase in technological development.
if we bridge the gap of analogical comprehension in these systems, they could also start introducing or suggesting technologies that could help slow down or reverse the collapse. i think this is much more important than making sure sarah silverman doesn’t have her work paraphrased.
We already know how to stop climate change, but we, as in capitalist society, does not want to.
AI is a duel sided blade. On one hand, you have an incredible piece of technology that can greatly improve the world. On the other, you have technology that can be easily misused to a disastrous degree.
I think most people can agree that an ideal world with AI is one where it is a tool to supplement innovation/research/creative output. Unfortunately, that is not the mindset of venture capitalists and technology enthusiasts. The tools are already extremely powerful, so these parties see them as replacements to actual humans/workers.
The saddest example has to be graphic designers/digital artists. It’s not some job that “anyone can do.” It’s an entire profession that takes years to master and perfect. AI replacement doesn’t just mean taking away their job, it’s rendering years of experience worthless. The frustrating thing is it’s doing all of this with their works, their art. Even with more regulations on the table, companies like adobe and deviant art are still using shady practices to unknowingly con users into building their AI algorithms (quietly instating automatic OPT-IN and making OPT-OUT options difficult). It’s sort of like forcing a man to dig their own grave.
You can’t blame artists for being mad about the whole situation. If you were in their same position, you would be just as angry and upset. The hard truth is that a large portion of the job market could likely be replaced by AI at some point, so it could happen to you.
These tools need to be TOOLS, not replacements. AI has it’s downfalls and expert knowledge should be used as a supplement to both improve these tools and the final product. There was a great video that covered some of those fundamental issues (such as not actually “knowing” or understanding what a certain object/concept is), but I can’t find it right now. I think the best comes when everyone is cooperating.
Even as tools, every time we increase worker productivity without a similar adjustment to wages we transfer more wealth to the top. It’s definitely time to seriously discuss a universal basic income.
Quoting this comment from the HN thread:
On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.
While it strikes me as perfectly plausible that the Books2 dataset contains Silverman’s book, this quote from the complaint seems obviously false.
First, even if the model never saw a single word of the book’s text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book’s Wikipedia page.
Second, it’s not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.
We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT’s training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman’s book.
I chose “The Ruby of Kishmoor” at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn’t even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn’t know anything about the story and it isn’t part of its training data.
If ChatGPT’s ability to summarize Silverman’s book comes from the book itself being part of the training data, why can it not do the same for other books?
As the commentor points out, I could recreate this result using a smaller offline model and an excerpt from the Wikipedia page for the book.
Second, it’s not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.
Summarising stuff is literally all ML models do. It’s their bread and butter: See what’s out there and categorise into a (ridiculously) high-dimensional semantic space. Put a bit flippantly: You shouldn’t be surprised if it’s giving you the same synopsis for both Dances with Wolves and Avatar because they are indeed very similar stories, occupying the same approximate position in that space. If you don’t ask for a summary but a full screenplay it’s going to come up with random details to fill in the details it ignored while categorising, again the results will look similar if you squint right because, again, they’re at the core the same story.
It’s not even really necessary for those models to learn the concept of “summary” – only that, in a prompt, it means “write a 200 word output instead of a 20000 word one”. It will produce a longer or shorter description of that position in space, hallucinating more or less details. It’s really no different than police interviewing you as a witness to a car accident and having to pay attention to not prompt you wrong, including assuming that you saw certain things or you, too, will come up with random bullshit (and believe it): It’s all a reconstructive process, generating a concrete thing from an abstract representation. There’s really no art to summary it’s inherent in how semantic abstraction works.
You are treating publicly available information as free from copyright, which is not the case. Wikipedia content is covered by the Creative Commons Attribution-ShareAlike License 4.0. Images might be covered by different licenses. Online articles about the book are also covered by copyright unless explicitly stated otherwise.
My understanding is that the copyright applies to reproductions of the work, which this is not. If I provide a summary of a copyrighted summary of a copyrighted work, am I in violation of either copyright because I created a new derivative summary?
Aren’t summaries and reviews covered under fair use? Otherwise Newspapers have been violating copyrights for hundreds of years.
Not a lawyer so I can’t be sure. To my understanding a summary of a work is not a violation of copyright because the summary is transformative (serves a completely different purpose to the original work). But you probably can’t copy someone else’s summary, because now you are making a derivative that serves the same purpose as the original.
So here are the issues with LLMs in this regard:
- LLMs have been shown to produce verbatim or almost-verbatim copies of their training data
- LLMs can’t figure out where their output came from so they can’t tell their user whether the output closely matches any existing work, and if it does what license it is distributed under
- You can argue that by its nature, an LLM is only ever producing derivative works of its training data, even if they are not the verbatim or almost-verbatim copies I already mentioned
LLMs have been shown to produce verbatim or almost-verbatim copies of their training data
That’s either overfitting and means the training went wrong, or plain chance. Gazillions of bonkers court cases over “did the artist at some point in their life hear a particular melody” come to mind. Great. Now that’s flanked with allegations of eidetic memory we have reached peak capitalism.
Don’t all three of those points apply to humans?