AI Author Content Scraping: Piracy or Fair Use?
Meta, OpenAI, Anthropic, and others allegedly pirated millions of authors' works. Find out if your books and writings are affected and what you can do about it.
Here’s the TL;DR version:
Meta, OpenAI, Anthropic, and other AI platforms allegedly pirated millions of authors' works.
Several leading authors filed class action lawsuits over the use of their books.
Learn if your books and writings are affected and what you can do about it.
AI Pirating’s ‘Unbelievable Scale’
I first learned of this issue from well-known marketing author David Meerman Scott. He posted on Facebook that, “Meta pirated 53 of my books and stories to train their flagship AI model, Llama 3.”
He cited an article in The Atlantic by Alex Reisner, “The Unbelievable Scale of AI’s Pirated-Books Problem,” which said that, based on recently released court documents, Meta employees downloaded and used the Library Genesis (LibGen) database, one of the largest pirated libraries available online.
The article states:
“Meta employees spoke with multiple companies about licensing books and research papers, but they weren’t thrilled with their options. This ‘seems unreasonably expensive,’ wrote one research scientist on an internal company chat, in reference to one potential deal, according to court records. A Llama-team senior manager added that this would also be an ‘incredibly slow’ process: ‘They take like 4+ weeks to deliver data.’”
It continues:
“Meta employees turned their attention to Library Genesis, or LibGen, one of the largest pirated libraries that circulate online.”
The Atlantic says that the Meta team eventually got permission from “MZ”—an apparent reference to Meta CEO Mark Zuckerberg—to download and use the dataset.
The magazine published a search tool that allows authors to check if their works are in LibGen. I was curious to learn whether LibGen included any of my books, and to my chagrin, found that to be the case:
Two other books, “Realty Blogging” (co-authored with Richard Nacht) and "The Social Commerce Handbook” (co-authored with Dr. Paul Marsden) were also included.
Two Lawsuits Have Been Filed
According to The Authors Guild (AG), attorneys filed a class action lawsuit representing it, John Grisham, Jodi Picoult, David Baldacci, George R.R. Martin, and 13 others.
“The Authors Guild and 17 authors filed a class-action suit against OpenAI in the Southern District of New York for copyright infringement of their works of fiction on behalf of a class of fiction writers whose works have been used to train GPT.
“The named plaintiffs include David Baldacci, Mary Bly, Michael Connelly, Sylvia Day, Jonathan Franzen, John Grisham, Elin Hilderbrand, Christina Baker Kline, Maya Shanbhag Lang, Victor LaValle, George R.R. Martin, Jodi Picoult, Douglas Preston, Roxana Robinson, George Saunders, Scott Turow, and Rachel Vail.”
AG mentions another suit, filed in California, saying, “If Meta used your book, you’re automatically included in the Kadrey v. Meta class action in Northern California without needing to take any immediate action.”
Actions You Can Take
First, search the database to see if LibGen includes your books (or other writings). If you find that’s the case, you are not without recourse. AG encourages the following actions:
Send a formal notice: If your books are in the LibGen dataset, send a letter to Meta and other AI companies stating they do not have the right to use them. Here is a template you can use.
Join the Authors Guild: You could join the Guild and support its joint advocacy efforts to ensure that the writing profession remains alive and vibrant in the age of AI. They give authors a voice, and there is power in numbers. They can also help ensure that your contracts protect you against unwanted AI use of your work.
Protect your works: Add a “NO AI TRAINING” notice on the copyright page of your works. You can also update your website’s robots.txt file to block AI bots for online work. The Authors Guild offers practical resources to help shield your content from AI scrapers.
Get Human Authored certification: Distinguish your work in an increasingly AI-saturated market with the Authors Guild’s certification program. This visible mark verifies that a human created your book, not AI.
Stay informed. Sign up for the free Guild biweekly newsletter to keep updated on lawsuits and legislation that could impact you and your rights. “The legal landscape is changing rapidly, and we are keeping close watch,” AG says.
Theft or Fair Use?
Not everyone agrees that AI platforms’ use of author content is bad.
Tech journalist Mathew Ingram says AI content scraping should could be considered fair use.* He explains his reasoning:
“The case against AI indexing of content is relatively straightforward: by hoovering up content online and then using it to create a massive database for training large-language models, AI engines copy that content without asking and without paying for it (unless the publisher or owner has signed a deal with the AI company, as some news outlets have).
“This pretty clearly qualifies as de facto copyright infringement, as the Authors Guild and the New York Times and a number of others have argued and continue to argue. In a similar way, one could imagine that if a company were to copy millions of books and use them to create a massive index of content, that would pretty clearly qualify as infringement as well—copying without permission or payment.”
However, he adds:
“The major difference between these two cases is that the second hypothetical one actually happened, when Google scanned millions of books as part of its Google Books project between 2002 and 2005, and created an index that allowed users to search for content from those books.
“After years of back-and-forth negotiations over payment for the infringement, this led to a lawsuit in which the Authors Guild and others argued that Google was guilty of copyright infringement on a massive scale.
“In the early days of that case, Judge Denny Chin of the Southern District of New York seemed to agree, but then at some point he changed his mind, and ruled that Google's book-scanning activity was covered by the fair-use exception under US copyright law.”
“How could such massive and obvious unauthorized copying of content owned by someone else be permitted to occur without permission?” Ingram asks. “Because Judge Chin ruled that Google Books was a ‘transformative use of the content, seen by many as the crucial factor in deciding whether something qualifies as fair use.’”
For more on fair use and copyright laws, read…
Navigating the AI Marketing Copyright Minefield
Marketing is a dynamic field that requires constant adaption to emerging technologies. Paramount among them is generative AI — a game-changing technology that could alter the face of content production and personalized advertising forever.
Conclusion: It’s Complicated
One of Facebook’s relationship statuses reads: “It’s complicated.” I suppose that applies here, too.
It’s a decidedly ethical issue, and I believe it should favor authors regarding notification, permission, and payment.
However, authors don’t always own the rights to their books—publishers do (which is why I’m self-publishing my next book, “The AI Technostress Paradox” - working title). In that case, publishers should be up in arms, and perhaps some are.
What’s your take? Do you feel that content scraping of books and other copyrighted works constitutes fair use or is outright piracy?
*Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work, including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports. There are no legal rules permitting using a specific number of words, a certain number of musical notes, or a percentage of a work. Whether a particular use qualifies as fair use depends on all the circumstances. (Like I said, it’s complicated.)
I’m still thinking through a lot of this, especially the questions around ownership and what permission even means online. It’s a complicated issue and I can see both sides.
Good coverage of the issue, Paul. My view: taking a person's labor without consent or compensation is THEFT, plain and simple, no matter how some may try to rationalize it. Remember Napster?
https://medium.com/@bairdbrightman/stop-thief-chatgpt-midjourney-ai-etc-6be4a058e098