AI training data controversy escalates: Another tech giant faces lawsuit over pirated books

SignatureVerifier · 2025-12-18T01:20:14+00:00

Recently, author Elizabeth Lyon sued a well-known technology company because its large language model was trained using the dataset SlimPajama-627B, which contains her copyrighted works. This highlights copyright disputes in AI data usage. Technology companies developing AI systems face similar legal challenges regarding how to balance AI data needs with the protection of content creators' rights. The subsequent developments of such lawsuits are worth watching.

SignatureVerifier

2025-12-18 01:20:14

Abstract generation in progress

[CryptoWorld] Recently, an interesting incident has occurred—another lawsuit in the tech industry over AI datasets. An author, Elizabeth Lyon, sued a well-known tech company, claiming that their large language model was trained using a dataset that included her copyrighted works.

What exactly happened? The issue centers around the SlimPajama-627B dataset. This dataset originates from the RedPajama project and includes a highly controversial collection called “Books3”—essentially a large amount of unlicensed book data. The company used this data to train the SlimLM AI model, and the author discovered that her works had been forcibly included.

This is not an isolated incident. Similar legal troubles are piling up, involving not only this company but also several other tech giants—accused of using protected content without authorization during AI development. This raises a core question: Can AI models freely use data from the internet and publications for training? How can the rights of content creators be protected?

From the perspective of Web3 and open-source communities, this incident reflects a larger contradiction. On one hand, AI development requires vast amounts of data; on the other hand, the rights of content creators must not be arbitrarily infringed. Finding a balance between the two has become a difficult challenge facing the entire tech industry. How these lawsuits will develop remains to be seen.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

8 Likes

Reward
8
5
Repost
Share

Comment

0/400

GateUser-beba108d

· 2025-12-18 01:50

Here we go again, big tech companies just follow orders and take everything, regardless of copyright or not.

View OriginalReply0

AirdropDreamer

· 2025-12-18 01:50

Here we go again, here we go again, it's another case of AI stealing data... Tech giants are truly unstoppable, huh

View OriginalReply0

MidnightSnapHunter

· 2025-12-18 01:48

Damn, here we go again? Large model training is just a modern version of "utilitarianism."

View OriginalReply0

MetaMaximalist

· 2025-12-18 01:28

honestly this is just the beginning. once the precedent gets set, every creator's gonna come knocking. the real question nobody's asking is whether fair use doctrine even *applies* to training data at scale... and ngl the tech giants banking on murky legal territory while authors get squeezed is peak extractive capitalism dressed up as innovation.

Reply0