Thanks, Jessica. I'm going to do my best. I'll be covering copyright issues soon. Another person also raised that as an issue of concern. "The pile"? That's not a term I've heard. Care to elaborate?
I feel compelled to call out this line from that Wikipedia page: "Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content.[1]"
This would be funny if it weren't so disturbing. Anyway, I digress. Back to The Pile and copyright issues.
One of the telling comments in that Hacker News thread mentions that the creators of The Pile (specifically the Books3 data subset) just started ignoring authors' requests related to whether and how their books could be used as training data.
You have to scroll down a bit but that part of the conversation is directly relevant.
In a nutshell, the public claims from companies like Meta and OpenAI are that their models were trained on publicly available content online. So nothing behind paywalls etc. However, that's not entirely true because datasets were sourced from the dark web and torrenting sites, and included copyrighted work by a host of authors.
I would classify this under highly unethical behavior. Though your Substack seems to be focused on a different angle – the ethics of using these tools, vs. the highly questionable ethics of how they were created.
It still seems relevant, though. If you want to use an LLM to write a piece "in the style of Michael Pollan," is it ethical to do so if he did not approve of his works being included in the training dataset?
I would argue no. Not ethical.
My favorite description of LLMs so far was the post I saw on Facebook that referred to them as "Plagiarism Machines."
Thanks for all that content, Jess. I appreciate it. Looks like a deep dive into this topic is called for. Also, thanks for subscribing. I appreciate it.
Hi, Denise. Thanks for your comment. I will, absolutely. I'm considering going weekly but it's a bandwidth issue. That would allow me to cover more topics quicker. Who's to say how all this ethics stuff is going to play out.
Ha! Excellent post title.
Looking forward to reading this, Paul. You’re right, there’s a huge gap here that needs to be filled.
Thank you for helping fill that void!
One topic I’m interested in is copyright issues. Some of the datasets the LLMs were trained on are sketchy AF. Have you heard of “the pile”?
Thanks, Jessica. I'm going to do my best. I'll be covering copyright issues soon. Another person also raised that as an issue of concern. "The pile"? That's not a term I've heard. Care to elaborate?
Yeah I saw that comment after I posted mine. Sensing a trend here! :)
Wikipedia article about The Pile: https://en.wikipedia.org/wiki/The_Pile_(dataset)
I feel compelled to call out this line from that Wikipedia page: "Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content.[1]"
This would be funny if it weren't so disturbing. Anyway, I digress. Back to The Pile and copyright issues.
Here's some additional coverage:
https://gizmodo.com/anti-piracy-group-takes-ai-training-dataset-books3-off-1850743763
Here's a fascinating thread from Hacker News:
https://news.ycombinator.com/item?id=25607809
One of the telling comments in that Hacker News thread mentions that the creators of The Pile (specifically the Books3 data subset) just started ignoring authors' requests related to whether and how their books could be used as training data.
You have to scroll down a bit but that part of the conversation is directly relevant.
In a nutshell, the public claims from companies like Meta and OpenAI are that their models were trained on publicly available content online. So nothing behind paywalls etc. However, that's not entirely true because datasets were sourced from the dark web and torrenting sites, and included copyrighted work by a host of authors.
A few more relevant articles:
https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai
Looks like Meta was one of the biggest culprits that used a dataset of pirated books to train their model:
https://interestingengineering.com/innovation/anti-piracy-group-shuts-down-books3-a-popular-dataset-for-ai-models
That last link is probably the place to start.
I would classify this under highly unethical behavior. Though your Substack seems to be focused on a different angle – the ethics of using these tools, vs. the highly questionable ethics of how they were created.
It still seems relevant, though. If you want to use an LLM to write a piece "in the style of Michael Pollan," is it ethical to do so if he did not approve of his works being included in the training dataset?
I would argue no. Not ethical.
My favorite description of LLMs so far was the post I saw on Facebook that referred to them as "Plagiarism Machines."
Thanks for all that content, Jess. I appreciate it. Looks like a deep dive into this topic is called for. Also, thanks for subscribing. I appreciate it.
You're welcome! Hopefully it's helpful.
Excellent topic to cover, Paul! I look forward to learning more from you. Copyright issues are a big gray area too. Will you be covering that as well?
Hi, Denise. Thanks for your comment. I will, absolutely. I'm considering going weekly but it's a bandwidth issue. That would allow me to cover more topics quicker. Who's to say how all this ethics stuff is going to play out.