Sarah Silverman Sues Maker Of ChatGPT For Copyright Infringement

DL :)@lemmy.ml · 1 year ago

Sarah Silverman Sues Maker Of ChatGPT For Copyright Infringement

givesomefucks@lemmy.world · edit-2 1 year ago

In evidence for the suit against OpenAI, the plaintiffs claim ChatGPT violates copyright law by producing a “derivative” version of copyrighted work when prompted to summarize the source.

Both filings make a broader case against AI, claiming that by definition, the models are a risk to the Copyright Act because they are trained on huge datasets that contain potentially copyrighted information

They’ve got a point.

If you ask AI to summarize something, it needs to know what it’s summarizing. Reading other summaries might be legal, but then why not just read those summaries first?

If the AI “reads” the work first, then it would have needed to pay for it. And how do you deal with that? Is a chatbot treated like one user? Or does it need to pay for a copy for each human that asks for a summary?

I think if they’d have paid for a single ebbok Library subscription they’d be fine. However the article says they used pirate libraries so it could read anything on the fly.

Pointing an AI at pirated media is going to be hard to defend in court. And a class action full of authors and celebrities isn’t going to be a cakewalk. They’ve got a lot of money to fight, and have lots of contacts for copyright laws. I’m sure all the publishers are pissed too.

Everyone is going after AI money these days, this seems like the rare case where it’s justified

Rivalarrival@lemmy.today · 1 year ago

If the AI “reads” the work first, then it would have needed to pay for it

That’s not actually true. Copyright applies to distribution, not consumption. You violate no law when I create an unauthorized copy of a work, and you read that copy. Copyright law prohibits you from distributing further copies, but it does not prohibit you from possessing the copy I provided you, nor are you prohibited from speaking about the copy you have acquired.

Unless the AI is regurgitating substantial parts of the original work, it’s output is a “transformative derivation”, which is not subject to the protections of the original copyright. The AI is doing what English teachers ask of every school-age child: create a book report.

TWeaK@lemm.ee · edit-2 1 year ago

Copyright applies to distribution, not consumption. You violate no law when I create an unauthorized copy of a work

This is completely untrue. Making any unauthorised copy is an infringement of copyright. Hell, the UK determined that merely loading a pirated game into RAM was unauthorised copying, making the act of playing a pirated game unlawful - thankfully this is ruling only the case in the UK, however the basic principles of copyright are the same all over the world.

When you buy something, you get a limited license to make copies for the purpose of viewing the material. That license does not extend to making backup copies. However, in a practical sense, it is very unlikely you will be prosecuted for most kinds of infringement like this - particularly when no money is involved. It’s still infringement, though.

Edit: I will say though: you violate no law when you view a copy I create. However I would still be infringing for making and showing you the copy.

In the case of making a book report, that is educational, and thus fair use. ChatGPT is not educational - you might use it for education, but ChatGPT’s use of copyrighted material is for commercial enterprise.

Rivalarrival@lemmy.today · 1 year ago

The uploader is the person creating the copy. Downloading is not creating a copy; downloading is receiving a copy.

I would love to see a citation on that UK precedent, but as you said: “thankfully this is only the case in the UK” and does not apply in the rest of the world.

Making any unauthorised copy is an infringement of copyright.

The exceptions to that are so numerous that the statement is closer to false than truth. “Fair Use” blows the absolute nature of that statement out of the water.

There has never been a successful prosecution for downloading only.

lobelia581@lemmy.dbzer0.com · 1 year ago

There was still copyright infringement because the company probably downloaded the text (which created another copy) and modified it (alteration is also protected by copyright) before using it as training data. If you write an original novel and admit that you had pirated a bunch of novels to use for reference, those novels were still downloaded illegally even if you’ve deleted them by now. The AI isn’t copyright infringement itself, it’s proof that copyright infringement has happened.

But personally I don’t think the actual laws will matter so much as which side has the better case for why they will lead to more innovation and growth for the economy.

Rivalarrival@lemmy.today · 1 year ago

There was still copyright infringement because the company probably downloaded the text (which created another copy)

Sure, someone likely infringed on copyright for that copy to be created, but the person/entity committing that infringement is the sender, not the receiver. The uploader is the infringing party, not the downloader.

If you write an original novel and admit that you had pirated a bunch of novels to use for reference, those novels were still downloaded illegally even if you’ve deleted them by now.

They were uploaded illegally. The people who distributed those copies to me have infringed on copyright, sure. My receiving those copies does not constitute infringement. Uploading is the illegal act, not downloading.

My work does not violate copyright, unless I use a substantial part of the other works. But, if I used substantial parts of those works, my work would be some sort of “derivation” and not the “original novel” you declared it. (Many types of derivation fall within “fair use” and do not constitute infringement.)

Whether I delete the works or not is entirely irrelevant. I am prohibited from creating and distributing additional copies, but I am not prohibited from receiving, possessing, or consuming an unauthorized copy.

lobelia581@lemmy.dbzer0.com · edit-2 1 year ago

The uploader is the infringing party, not the downloader.

an exclusive right of the copyright holder is the right to duplicate their work. downloading IS illegal because you’re creating an unauthorized duplicate of the work on your machine. your duplicate is distinct from the duplicate that someone else had created and uploaded. it’s just very hard to get caught downloading, and it’s not very cost effective for companies to pursue since they would only stop one person. that’s why most companies like the RIAA targeted torrents for their lawsuits, because they could easily see the ip addresses (which is why you should always use a vpn when torrenting) and because they could shut down uploaders. but downloading itself is still very illegal

My work does not violate copyright, unless I use a substantial part of the other works.

like I said, the AI is not a violation (probably, unless the courts later disagree), it’s proof that unauthorized duplication of copyrighted works has occurred, and that is illegal

Rivalarrival@lemmy.today · 1 year ago

You cannot create a copy of a work that you do not possess. The downloader does not possess the work to create a copy. Only the uploader is even capable of creating the copy. The downloader cannot create a copy; he can only request.

If he does something else with that copy he receives, he becomes something other than merely a downloader. That “something else” could be unlawful, but that “something else” is not “downloading”.

It could be unlawful if the downloader gains unauthorized access to the computer system, but that would not be a copyright violation. It could be unlawful if the downloader conspires with the uploader, but the degree of collaboration would have to be much greater to support a conspiracy charge.

Downloading does not meet the statutory criteria for copyright infringement. Downloading alone is not infringement.

lobelia581@lemmy.dbzer0.com · 1 year ago

US

UK

EU

Australia cuz why the hell not

ANGRY_MAPLE@sh.itjust.works · edit-2 1 year ago

The “You wouldn’t steal a car” anti-piracy ad is coming to mind lol

givesomefucks@lemmy.world · 1 year ago

They get people torrenting movies by saying you seed while you leach…

So if they torrented them in mass, they broke it.

ChaoticStupid@lemmy.world · 1 year ago

“It was like this when I got it”

limeaide@lemmy.ml · 1 year ago

Can the sources where ChatGPT got it’s information from be traced? What if it got the information from other summaries?

I think the hardest thing for these companies will be validating the information their AI is using. I can see an encyclopedia-like industry popping up over the next couple years.

Btw I know very little about this topic but I find it fascinating

rainroar@lemmy.ml · 1 year ago

Yes! They publish the data sources and where they got everything from. Diffusers (stable diffusion/midjoirny etc) and GPT both use tons of data that was taken in ways that likely violate that data’s usage agreement.

Imo they deserve whatever lawsuits they have coming.

radarsat1@lemmy.ml · 1 year ago

likely violate that data’s usage agreement.

It doesn’t seem to be too common for books to include specific clauses or EULAs that prohibit their use as data in machine learning systems. I’m curious if there are really any aspects that cover this without it being explicitly mentioned. I guess we’ll find out.

rainroar@lemmy.ml · 1 year ago

I think with a book your standard digital license / copyright would forbid it, would it not?

radarsat1@lemmy.ml · 1 year ago

Maybe. I’m interested in the specifics.

Beej Jorgensen@lemmy.sdf.org · 1 year ago

It depends on if the summary is an infringing derivative work, doesn’t it? Wikipedia is full of summaries, for example, and it’s not violating copyright.

If they illegally downloaded the works, that feels like a standalone issue to me, not having anything to do with AI.

TWeaK@lemm.ee · 1 year ago

Wikipedia is a non profit whose primary purpose is education. ChatGPT is a business venture.

Rivalarrival@lemmy.today · 1 year ago

A book review published in a newspaper is a commercial venture for the purpose of selling ads. The commercial aspect doesn’t make the review an infringement.

A summary is a “Transformative Derivation”. It is a related work, created for a fundamentally different purpose. It is a discussion about the work, not a copy of the work. Transformative derivations are not infringements, even where they are specifically intended to be used for commercial purposes.

dartos@reddthat.com · 1 year ago

I’ve noticed that the lemmy crowd seems more accepting of AI stuff than the Reddit crowd was

Aniki 🌱🌿@lemm.ee · 1 year ago

I mean for tech stuff it’s fantastic. I could spend 30 minutes working out a regex to grep the logs in the format I need or I could have a back and forth with ChatGPT and get it sorted in 5.

I still don’t want it to write my TV or movies. Or code to a significant degree.

Colonel Sanders@lemmy.world · 1 year ago

On the flip side, anytime I’ve tried to use it to write python scripts for me, it always seems to get them slightly wrong. Nothing that a little troubleshooting can’t handle, and certainly helps to get me in the ballpark of what I’m looking for, but I think it still has a little ways to go for specific coding use cases.

TheSaneWriter@lemm.ee · 1 year ago

I think the key there is that ChatGPT isn’t able to run its own code, so all it can do is generate code which “looks” right, which in practice is close to functional but not quite. In order for the code it writes to reliably work, I think it would need a builtin interpreter/compiler to actually run the code, and for it to iterate constantly making small modifications until the code runs, then return the final result to the user.

icosahedron@ttrpg.network · 1 year ago

The new code interpreter is able to run its own code, but i haven’t personally tested it to see if its code is more often functional.

Okalaydokalay@lemm.ee · edit-2 1 year ago

deleted by creator

Sheltac@lemmy.world · 1 year ago

It can even deal with basic algebra, it’s awesome. I can’t be fucked to work out this 16-var linear system, or even to write out the sympy to do it.

But guess who is?

TWeaK@lemm.ee · 1 year ago

I for one welcome our SkyNet overlords. They can’t be much worse than the current global leaders…

gamer@lemm.ee · 1 year ago

That’s genius! I’ve been trying to figure out how to incorporate ChatGPT-like bots into my work, but haven’t found it to be that useful. I don’t write a lot of regex, but hate it every time I do, so I’ll definitely be trying this next time I need it.

throwsbooks@lemmy.world · 1 year ago

It’s probably related to the fact that it seems a lot of Lemmy users are in tech, rather than art.

I think generative AI is a great tool, but a lot of people who don’t understand how it works either overestimate (it can do everything and it’s so smart!!) or underestimate it (all it does is steal my work!!)

comic_zalgo_sans@lemmy.world · edit-2 1 year ago

deleted by creator

throwsbooks@lemmy.world · 1 year ago

Personally, I’m a comp sci graduate who did several courses exploring AI, but I actually started out in fine arts and continue to paint, write, and play music to this day. I’m sure I’ll be blending these studies in some way when I move on to my master’s.

I agree that automation is scary. It’s unregulated. But it’s not the tech so much that’s evil, but rather the employers who see it as a reason to get rid of employees. And before, it’d be manual labour that we replaced with machines. People doing mental labour thought they were immune, until now they’re not. Our economic system’s going to need to change in some way.

But generative AI can be very good even for artists. For example, sometimes I suffer from writer’s block (who doesn’t?). Now, I can feed what I’m working on into chatGPT and have it spit out an example of the next paragraph. Sometimes that’s enough to spur me on so I can write the next page.

Artist movements in general are pretty conservative. When digital painting first became a thing, allowing people use layers and filters so easily, the kneejerk reaction by artists was to consider it cheating.

My hope is that in an ideal world, human-made art becomes valuable in the future precisely because it has the human touch. Live music played on real instruments, paintings on canvas, the sorts of things with quirks and imperfections and a human element that can’t be mass produced. Let the corporations have their algorithmic, soulless advertisements, and let the people focus on true self expression.

But then for people without artistic talent, say those who want to make indie games but can’t hire an artist or a musician because they’re just some kid with a dream and little experience? Hell, why not let them generate some assets with AI?

But we need to make sure that people aren’t afraid of becoming homeless, starving on the streets. I think, we’re not getting rid of AI at this point, it’s too powerful, and I don’t have an answer to our societal problems. For better or worse, we’ll adapt.

surrendertogravity@wayfarershaven.eu · 1 year ago

I appreciate this point of view! My BA is in visual arts, but I’ve also leaned heavily into tech, programming as a hobby, etc.

I think there’s a lot of different topical threads at play when it comes to AI art (classism and fine art, what average viewers vs trained viewers find appealing in a visual medium, etc) – but the economic issue that you point out are really key. Many artists rely on their craft for their literal bodily survival, so AI art is very much a real threat to them.

But, when I first interacted with Midjourney, and seeing my mom (just an average lady) being excited about AI generated art, I can’t help but see it like photography – all of a sudden the average person gets access to a way of visually capturing things that make them happy, that they think look cool, something they saw in a dream but didn’t have the skill to create visually… and that doesn’t sound like an inherently bad thing to me.

Holodeck_Moriarty@lemm.ee · 1 year ago

I just think it’s awesome technology and that we shouldn’t be holding it back. AI is pandora’s box, and that box can’t be closed now that it’s open.

All these attempts to restrict it remind me of the old efforts to stop people from taping TV shows with their VCRs.

Asafum@lemmy.world · 1 year ago

I feel like when confronted about a “stolen comedy bit” a lot of these people complaining would also argue that “no work is entirely unique, everyone borrows from what already existed before.” But now they’re all coming out of the woodwork for a payday or something… It’s kinda frustrating especially if they kill any private use too…

TheyHaveNoName@lemmy.fmhy.ml · 1 year ago

I’m a teacher and the last half of this school year was a comedy of my colleagues trying to “ban” chat GPT. I’m not so much worried about students using chat GPT to do work. A simple two minute conversation with a student who creates an excellent (but suspected) piece of writing will tell you whether they wrote it themselves or not. What worries me is exactly those moments where you’re asking for a summary or a synopsis of something. You really have no idea what data is being used to create that summary.

BedbugCutlefish@lemmy.world · edit-2 1 year ago

The issue isn’t that people are using others works for ‘derivative’ content.

The issue is that, for a person to ‘derive’ comedy from Sarah Silverman the ‘analogue’ way, you have to get her works legally, be that streaming her comedy specials, or watching movies/shows she’s written for.

With chat GPT and other AI, its been ‘trained’ on her work (and, presumably as many other’s works as possible) once, and now there’s no ‘views’, or even sources given, to those properties.

And like a lot of digital work, its reach and speed is unprecedented. Like, previously, yeah, of course you could still ‘derive’ from people’s works indirectly, like from a friend that watched it and recounted the ‘good bits’, or through general ‘cultural osmosis’. But that was still limited by the speed of humans, and of culture. With AI, it can happen a functionally infinite number of times, nearly instantly.

Is all that to say Silverman is 100% right here? Probably not. But I do think that, the legality of ChatGPT, and other AI that can ‘copy’ artist’s work, is worth questioning. But its a sticky enough issue that I’m genuinely not sure what the best route is. Certainly, I think current AI writing and image generation ought to be ineligible for commercial use until the issue has at least been addressed.

azuth@lemmy.world · 1 year ago

The issue is that, for a person to ‘derive’ comedy from Sarah Silverman the ‘analogue’ way, you have to get her works legally, be that streaming her comedy specials, or watching movies/shows she’s written for.

Damn did they already start implanting DRM bio-chips in people?

And like a lot of digital work, its reach and speed is unprecedented. Like, previously, yeah, of course you could still ‘derive’ from people’s works indirectly, like from a friend that watched it and recounted the ‘good bits’, or through general ‘cultural osmosis’.

Please explain why you cannot download a movie/episode/ebook illegally and then directly derive from it.

Rivalarrival@lemmy.today · 1 year ago

Please explain why you cannot download a movie/episode/ebook illegally and then directly derive from it.

The law does not prohibit the receiving of an unauthorized copy. The law prohibits the distribution of the unauthorized copy. It is possible to send/transmit/upload a movie/episode/ebook illegally, but the act of receiving/downloading that unauthorized copy is not prohibited and not illegal.

You can’t illegally download a movie/episode/ebook for the same reason that you can’t illegally park your car in your own garage: there is no law making it illegal.

Even if ChatGPT possesses an unauthorized copy of the work, it would only violate copyright law if it created and distributed a new copy of that work. A summary of the work would be considered a “transformative derivation”, and would fall well within the boundaries of fair-use.

BedbugCutlefish@lemmy.world · edit-2 1 year ago

I mean, you can do that, but that’s a crime.

Which is exactly what Sarah Silverman is claiming ChatGPT is doing.

And, beyond a individual crime of a person reading a pirated book, again, we’re talking about ChatGPT and other AI magnifying reach and speed, beyond what an individual person ever could do even if they did nothing but read pirated material all day, not unlike websites like The Pirate Bay. Y’know, how those website constantly get taken down and have to move around the globe to areas where they’re beyond the reach of the law, due to the crimes they’re doing.

I’m not like, anti-piracy or anything. But also, I don’t think companies should be using pirated software, and my big concern about LLMs aren’t really for private use, but for corporate use.

azuth@lemmy.world · 1 year ago

I mean, you can do that, but that’s a crime.

Consuming content illegally is by definition a crime, yes. It also has no effect on your output. A summary or review of that content will not be infringing, it will still be fair use.

A more substantial work inspired by that content could be infringing or not depending on how close it is to the original content but not on the legality of your viewing of that content.

Nor is it relevant. If you have any success with your copy you are going to cause way more damage to the original creator than pirating one copy.

And, beyond a individual crime of a person reading a pirated book, again, we’re talking about ChatGPT and other AI magnifying reach and speed, beyond what an individual person ever could do even if they did nothing but read pirated material all day, not unlike websites like The Pirate Bay. Y’know, how those website constantly get taken down and have to move around the globe to areas where they’re beyond the reach of the law, due to the crimes they’re doing.

I can assure you that The Pirate Bay is quite stable. I would like to point out that none of AI vendors has been actually convicted of copyright infringement yet. That their use is infringing and a crime is your opinion.

It also going to be irrelevant because there are companies that do own massive amounts of copyrighted materials and will be able to train their own AIs, both to sell as a service and to cut down on labor costs of creating new materials. There are also companies that got people to agree to licensing their content for AI training such as Adobe.

So copyright law will not be able to help creators. So there will be a push for more laws and regulators. Depending on what they manage to push through you can forget non major corp backed AI, reduced fair use rights (as in unapproved reviews being de-facto illegal) and perhaps a new push against software that could be used for piracy such as non-regulated video or music players, nevermind encoders etc.

BedbugCutlefish@lemmy.world · 1 year ago

Consuming content illegally is by definition a crime, yes. It also has no effect on your output. A summary or review of that content will not be infringing, it will still be fair use.

That their use is infringing and a crime is your opinion.

“My opinion”? have you read the headline? Its not my opinion that matters, its that of the prosecution in this lawsuit. And this lawsuit indeed alleges that copyright infringement has occurred; it’ll be up to the courts to see if the claim holds water.

I’m definitely not sure that GPT4 or other AI models are copyright infringing or otherwise illegal. But, I think that there’s enough that seems questionable that a lawsuit is valid to do some fact-finding, and honestly, I feel like the law is a few years behind on AI anyway.

But it seem plausible that the AI could be found to be ‘illegally distributing works’, or otherwise have broken IP laws at some point during their training or operation. A lot depends on what kind of agreements were signed over the contents of the training packages, something I frankly know nothing about, and would like to see come to light.

azuth@lemmy.world · 1 year ago

“My opinion”? have you read the headline? Its not my opinion that matters, its that of the prosecution in this lawsuit. And this lawsuit indeed alleges that copyright infringement has occurred; it’ll be up to the courts to see if the claim holds water.

No, the opinion that matters is the opinion of the judge. Before we have a decision, there is no copyright infringement.

I’m definitely not sure that GPT4 or other AI models are copyright infringing or otherwise illegal. But, I think that there’s enough that seems questionable that a lawsuit is valid to do some fact-finding You sure speak as if you do.

and honestly, I feel like the law is a few years behind on AI anyway.

But it seem plausible that the AI could be found to be ‘illegally distributing works’, or otherwise have broken IP laws at some point during their training or operation. A lot depends on what kind of agreements were signed over the contents of the training packages, something I frankly know nothing about, and would like to see come to light.

I 've said in my previous post that copyright will not solve the problems, what you describe as it being behind AI. Considering how the laws regarding copyright ‘caught up with the times’ in the beginning of the internet… I am not optimistic the changes will be beneficial to society.

Rivalarrival@lemmy.today · 1 year ago

Consuming content illegally is by definition a crime, yes.

What law makes it illegal to consume an unauthorized copy of a work?

That’s not a flippant question. I am being absolutely serious. Copyright law prohibits the creation and distribution of unauthorized copies; it does not prohibit the reception, possession, or consumption of those copies. You can only declare content consumption to be “illegal” if there is actually a law against it.

azuth@lemmy.world · 1 year ago

What law makes it illegal to consume an unauthorized copy of a work?

That’s not a flippant question. I am being absolutely serious. Copyright law prohibits the creation and distribution of unauthorized copies; it does not prohibit the reception, possession, or consumption of those copies. You can only declare content consumption to be “illegal” if there is actually a law against it.

Which legal system?

Rivalarrival@lemmy.today · 1 year ago

The issue is that, for a person to ‘derive’ comedy from Sarah Silverman the ‘analogue’ way, you have to get her works legally,

That is not actually true.

I would violate copyright by making an unauthorized copy and providing it to you, but you do not violate copyright for simply viewing that unauthorized copy. Sarah can come after me for creating the cop[y|ies], but she can’t come after the people to whom I send them, even if they admit to having willingly viewed a copy they knew to be unauthorized.

Copyright applies to distribution, not consumption.

barsoap@lemm.ee · 1 year ago

The issue is that, for a person to ‘derive’ comedy from Sarah Silverman the ‘analogue’ way, you have to get her works legally, be that streaming her comedy specials, or watching movies/shows she’s written for.

I can also talk to a guy in a bar rambling about her work. That guy’s name? ChatGPT.

Tosti@feddit.nl · edit-2 10 months ago

deleted by creator

Sparky678348@lemm.ee · 1 year ago

I know this is kind of a silly argument but storing protected work in our own human memories to recall later is certainly not reproduction.

I don’t think it’s reproduction for chat GPT to file away that information to call on it later. It’s just better at it than we are.

TheSaneWriter@lemm.ee · 1 year ago

If the models were trained on pirated material, the companies here have stupidly opened themselves to legal liability and will likely lose money over this, though I think they’re more likely to settle out of court than lose. In terms of AI plagiarism in general, I think that could be alleviated if an AI had a way to cite its sources, i.e. point back to where in its training data it obtained information. If AI cited its sources and did not word for word copy them, then I think it would fall under fair use. If someone then stripped the sources out and paraded the work as their own, then I think that would be plagiarism again, where that user is plagiarizing both the AI and the AI’s sources.

ayaya@lemmy.fmhy.ml · edit-2 1 year ago

It is impossible for an AI to cite its sources, at least in the current way of doing things. The AI itself doesn’t even know where any particular text comes from. Large language models are essentially really complex word predictors, they look at the previous words and then predict the word that comes next.

When it’s training it’s putting weights on different words and phrases in relation to each other. If one source makes a certain weight go up by 0.0001% and then another does the same, and then a third makes it go down a bit, and so on-- how do you determine which ones affected the outcome? Multiply this over billions if not trillions of words and there’s no realistic way to track where any particular text is coming from unless it happens to quote something exactly.

And if it did happen to quote something exactly, which is basically just random chance, the AI wouldn’t even be aware it was quoting anything. When it’s running it doesn’t have access to the data it was trained on, it only has the weights on its “neurons.” All it knows are that certain words and phrases either do or don’t show up together often.

Zetaphor@zemmy.cc · edit-2 1 year ago

Quoting this comment from the HN thread:

On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.

While it strikes me as perfectly plausible that the Books2 dataset contains Silverman’s book, this quote from the complaint seems obviously false.

First, even if the model never saw a single word of the book’s text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book’s Wikipedia page.

Second, it’s not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.

We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT’s training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman’s book.

I chose “The Ruby of Kishmoor” at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn’t even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn’t know anything about the story and it isn’t part of its training data.

If ChatGPT’s ability to summarize Silverman’s book comes from the book itself being part of the training data, why can it not do the same for other books?

As the commentor points out, I could recreate this result using a smaller offline model and an excerpt from the Wikipedia page for the book.

patatahooligan@lemmy.world · 1 year ago

You are treating publicly available information as free from copyright, which is not the case. Wikipedia content is covered by the Creative Commons Attribution-ShareAlike License 4.0. Images might be covered by different licenses. Online articles about the book are also covered by copyright unless explicitly stated otherwise.

Zetaphor@zemmy.cc · edit-2 1 year ago

My understanding is that the copyright applies to reproductions of the work, which this is not. If I provide a summary of a copyrighted summary of a copyrighted work, am I in violation of either copyright because I created a new derivative summary?

patatahooligan@lemmy.world · 1 year ago

Not a lawyer so I can’t be sure. To my understanding a summary of a work is not a violation of copyright because the summary is transformative (serves a completely different purpose to the original work). But you probably can’t copy someone else’s summary, because now you are making a derivative that serves the same purpose as the original.

So here are the issues with LLMs in this regard:

LLMs have been shown to produce verbatim or almost-verbatim copies of their training data
LLMs can’t figure out where their output came from so they can’t tell their user whether the output closely matches any existing work, and if it does what license it is distributed under
You can argue that by its nature, an LLM is only ever producing derivative works of its training data, even if they are not the verbatim or almost-verbatim copies I already mentioned

barsoap@lemm.ee · edit-2 1 year ago

LLMs have been shown to produce verbatim or almost-verbatim copies of their training data

That’s either overfitting and means the training went wrong, or plain chance. Gazillions of bonkers court cases over “did the artist at some point in their life hear a particular melody” come to mind. Great. Now that’s flanked with allegations of eidetic memory we have reached peak capitalism.

Banzai51@midwest.social · 1 year ago

Aren’t summaries and reviews covered under fair use? Otherwise Newspapers have been violating copyrights for hundreds of years.

barsoap@lemm.ee · 1 year ago

Second, it’s not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.

Summarising stuff is literally all ML models do. It’s their bread and butter: See what’s out there and categorise into a (ridiculously) high-dimensional semantic space. Put a bit flippantly: You shouldn’t be surprised if it’s giving you the same synopsis for both Dances with Wolves and Avatar because they are indeed very similar stories, occupying the same approximate position in that space. If you don’t ask for a summary but a full screenplay it’s going to come up with random details to fill in the details it ignored while categorising, again the results will look similar if you squint right because, again, they’re at the core the same story.

It’s not even really necessary for those models to learn the concept of “summary” – only that, in a prompt, it means “write a 200 word output instead of a 20000 word one”. It will produce a longer or shorter description of that position in space, hallucinating more or less details. It’s really no different than police interviewing you as a witness to a car accident and having to pay attention to not prompt you wrong, including assuming that you saw certain things or you, too, will come up with random bullshit (and believe it): It’s all a reconstructive process, generating a concrete thing from an abstract representation. There’s really no art to summary it’s inherent in how semantic abstraction works.

Riptide502@lemm.ee · 1 year ago

AI is a duel sided blade. On one hand, you have an incredible piece of technology that can greatly improve the world. On the other, you have technology that can be easily misused to a disastrous degree.

I think most people can agree that an ideal world with AI is one where it is a tool to supplement innovation/research/creative output. Unfortunately, that is not the mindset of venture capitalists and technology enthusiasts. The tools are already extremely powerful, so these parties see them as replacements to actual humans/workers.

The saddest example has to be graphic designers/digital artists. It’s not some job that “anyone can do.” It’s an entire profession that takes years to master and perfect. AI replacement doesn’t just mean taking away their job, it’s rendering years of experience worthless. The frustrating thing is it’s doing all of this with their works, their art. Even with more regulations on the table, companies like adobe and deviant art are still using shady practices to unknowingly con users into building their AI algorithms (quietly instating automatic OPT-IN and making OPT-OUT options difficult). It’s sort of like forcing a man to dig their own grave.

You can’t blame artists for being mad about the whole situation. If you were in their same position, you would be just as angry and upset. The hard truth is that a large portion of the job market could likely be replaced by AI at some point, so it could happen to you.

These tools need to be TOOLS, not replacements. AI has it’s downfalls and expert knowledge should be used as a supplement to both improve these tools and the final product. There was a great video that covered some of those fundamental issues (such as not actually “knowing” or understanding what a certain object/concept is), but I can’t find it right now. I think the best comes when everyone is cooperating.

RoundSparrow @ .ee@lemm.ee · 1 year ago

The comic’s suit questions if AI models can function without training themselves on protected works.

I doubt a human can compose chat responses without having trained at school on previous language. Copyright favors the rich and powerful, established like Silverman.

trachemys@lemmy.world · 1 year ago

We are overdue for strengthening fair use.

Rivalarrival@lemmy.today · 1 year ago

Indeed.

Possession of a copyrighted work should never be considered infringement. The fact that a book is floating around in a mind must not be considered infringement no matter how it got into that mind, nor whether that mind is biological or artificially constructed.

Until that work comes back out of that mind in substantially identical form as to how it went in, it cannot be considered copyright infringement.

patatahooligan@lemmy.world · 1 year ago

Selectively breaking copyright laws specifically to allow AI models also favors the rich, unfortunately. These models will make a very small group of rich people even richer while putting out of work the millions of creators whose works wore stolen to train the models.

TheSaneWriter@lemm.ee · 1 year ago

To be fair, in most Capitalist nations, literally any decision made will favor the rich because the system is automatically geared that way. I don’t think the solution is trying to come up with more jobs or prevent new technology from emerging in order to preserve existing jobs, but rather to retool our social structure so that people are able to survive while working less.

patatahooligan@lemmy.world · 1 year ago

Oh no, rich assholes who continuously lobby for strict copyright and patent laws in order to suffocate competition might find themselves restricted by it for once. Quick, find me the world’s smallest violin!

No, if you want AI to emerge, argue in favor of relaxing copyright law in all cases, not specifically to allow AI to copyright launder other peoples’ works.

vlad@lemmy.sdf.org · 1 year ago

I was under impression that there was no real definitive way to tell what ChatGPT or similar AI use for their training. Am I wrong?

NevermindNoMind@lemmy.world · 1 year ago

Yes, it’s in the lawsuit and another article I read. Open AI said they used a specific dataset, and the makers of that dataset said they used some online open libraries which have full texts of books. That’s the primary basis of the lawsuit. They also argue that if you ask ChatGPT for a summary of their books, it will spit one out, which they are claiming is misuse of their copywriten work. That claim sounds dicey to me, Wikipedia and all manner of websites summarize books, so I’m not following how ChatGPT doing it is different. But I’m an idiot so who cares what I think.

hurp_mcderp@lemmy.ml · edit-2 1 year ago

Remember, the human that wrote a summary had to legally obtain a copy of the source material first too. It should be no different when training an AI model. There’s a whole new can of worms here, though, since the summary was written by another person and that person holds the copyright to that summary (unless there is a substantial amount of the original material, of course). But an AI model is not “creating” a new, copyrightable work. It has to be trained on the entire source material and algorithmically creates a summary directly from that. Because there’s nothing ‘new’ being created, I can see why it could be claimed that a summary from an AI model should be considered a derivative work. But honestly, it’s starting to border on the question of whether or not what AI models can do is considered ‘creative thinking’. Shit’s getting wild.

vlad@lemmy.sdf.org · 1 year ago

I care. Idiots unite!

5 Card Draw@lemmy.fmhy.ml · 1 year ago

Copyright laws are a recent phenomenon and should have never been a thing imo. The only reason it’s there is not to “protect creators,” but to make sure upper classes extract as much wealth over the maximum amount of time possible.

Music piracy has showed that it’s got too many holes in it to be effective, and now AI is showing us its redundancy as it uses data to give better results.

it stifles creativity to the point it makes us inhuman. Hell, Chinese writers used to praise others if they used a line or two from other writers.

TheSaneWriter@lemm.ee · 1 year ago

I think that copyright laws are fine in a vacuum, but that if nothing else we should review the amount of time before a copyright enters the public domain. Disney lobbied to have it set to something awful like 100 years, and I think it should almost certainly be shorter than that.

Marxine@lemmy.ml · 1 year ago

VC backed AI makers and billionaire-ran corporations should definitely pay for the data they use to train their models. The common user should definitely check the licences of the data they use as well.

SixTrickyBiscuits@lemmy.world · 1 year ago

That is essentially impossible. How are they going to pay each reddit user whose comment the AI analyzed? Or each website it analyzed? We’re talking about terabytes of text data taken from a huge variety of sources.

CannaVet@lemmy.world · edit-2 1 year ago

Then it should be treated as what it is, an illegal venture based off of theft. I don’t get a legal pass to steal just because the groceries I stole got cooked into a meal and are therefore no longer the groceries I stole.

azuth@lemmy.world · 1 year ago

Firstly copyright infringement is not theft. It’s not theft because the grocer still has the groceries. It is a lesser crime which obviously hurts the victim less if at all in some cases.

A summary is also not copyright infringement, it’s fair use. Of course copyright holders would love to copyright strike bad reviews (they already do even though it’s not illegal).

Marxine@lemmy.ml · 1 year ago

Billionaires can spend and burn their whole net worth for all I care. Datasets should be either:

Paid for to the provider platform, and each original content creator gets a share (eg. The platform keeps 10% of the sold price for hosting costs, the 90% remaining are distributed to content creators according to size and quality of the data provided)
Consciously donated by the content creators (eg: an OPT-IN term in the platform about donating agreed upon data for non-profit research), but the dataset must never be sold for or used for profit. Publicly available research purposes only.
Dataset is “rented” by the users and platform in an OPT-IN manner, and they receive royalties/payments for each purchase/usage of the dataset.

The current manner things are done only favours venture capitalists (wage thieves), shareholders (also wage thieves) and billionaire C-suits (wage thieves as well).

Maslo@lemmy.world · 1 year ago

I can’t really take seriously any accusations coming from Sarah Silverman after that whole wage gap bs she tried to pull.

Seems like she isn’t afraid to manipulate a trending social outcry to collect a paycheck.

Max_Power@feddit.de · edit-2 1 year ago

I like her and I get why creatives are panicking because of all the AI hype.

However:

In evidence for the suit against OpenAI, the plaintiffs claim ChatGPT violates copyright law by producing a “derivative” version of copyrighted work when prompted to summarize the source.

A summary is not a copyright infringement. If there is a case for fair-use it’s a summary.

The comic’s suit questions if AI models can function without training themselves on protected works.

A language model does not need to be trained on the text it is supposed to summarize. She clearly does not know what she is talking about.

IANAL though.

WarmSoda@lemm.ee · 1 year ago

The plaintiffs claim ChatGPT violates copyright law by producing a “derivative” version of copyrighted work when prompted to summarize the source.

That’s an interesting angle. All these lawsuits are good for shaking the dirt around these things. They should be tested in the real world before they become lost in the background of every day life.

We do already have a defense against these programs to stop them from scraping a site. I asked chatgpt once how it gets around captchas on websites, and it told it if there is one then it just doesn’t go any further.

If that’s actually true or not is another question though.

Sagrotan@lemmy.world · 1 year ago

Like the record labels sued every music sharing platform in the early days. Adapt. They’re all afraid of new things but in the end nobody can stop it. Think, learn, work with it, not against it.

diskmaster23@lemmy.one · 1 year ago

I think it’s valid. This isn’t about the tech, but the sources of your work.

Sagrotan@lemmy.world · 1 year ago

Of course it’s valid. And the misuse of AI has to be fight. Nevertheless we have to think differently in the face of something we cannot stop in the long run. You cannot create a powerful tool and only misuse it. I miscommunicated here, should’ve explained myself, I got no excuses, maybe one: I sat on the shitter and wanted to make things short.