The LLM copyright issue is more like sampling than Napster

source: https://alchetron.com/Dust-Brothers

OpenAI — and LLMs in general that train by scraping data from the web and ignoring copyrights — are in legal jeopardy. There are multiple lawsuits filed and many of us are wondering how they’ll shake out. People have made many comparisons to Napster, and I think those are valid, in a way. I also think there’s a valid comparison to sampling in music.

Has anyone asked Biz Markie or the Dust Brothers for their opinions on ChatGPT? 

In my opinion, the current legal precedence for sampling is a mess. Take this with a grain of salt though, as I’m not a legal expert. I welcome criticism from experts on this. As far as I can tell, there are multiple conflicting legal opinions and it is difficult to operate outside of the “clear everything” modus operandi, which can obviously be cumbersome (especially for amateurs) and unfair (e.g., Bitter Sweet Symphony). I’m concerned that the debate in training AI models will evolve similarly– with a patchwork of legal opinions and decisions, and specious negotiations. To me, that is a realistic worst case scenario for AI.

And that’s maybe why business leaders in AI have been pushing the US Congress to legislate guidelines for the use of copyrighted material for training AI models. It would be preferable (especially since the companies would likely get to heavily influence the legislation) to the mess that music sampling is today.

By the way, indexers like Yahoo and Google generally obey the robots.txt standard for opting out of indexing. Do LLMs respect that guidance as well?

Tangentially related: I tried to get Bing’s GPT-4 based chat to find a quote for me today. It seemed to be working great, but I fact checked it and GPT-4 had completely made up the quote and the source. The book was real, but the quote wasn’t in it. I even did a text search on the book. I called GPT-4 out on the error and asked it to double-check. It admitted its error and then gave me a different page in the same book– which also didn’t have the quote. It was a complete fabrication from start to finish. It sounded good, but it wasn’t good.