First reported by Ars Technica, the copyright case against Facebook parent company Meta over its use of authors’ work to train large language models has unearthed some embarrassing dirty laundry in discovery. Dozens of emails, allegedly between Meta employees, discuss torrenting massive amounts of pirated material—and seeding those torrents to boot—in order to train the company’s AI models.
It was revealed via court documents last month that Meta had obtained AI training data from LibGen, a large file sharing database that includes everything from paywalled news and academic articles, to whole books. The prosecution alleges that Meta downloaded over 80 terabytes from LibGen and another so-called “shadow library” by the name of Z-Library. This is, to be clear, internet piracy on a scale that would make a Nintendo lawyer blush, and the lawsuit alleges the emails put in writing “Meta’s decision to take and use copyrighted works without permission that it knew to be pirated, despite clear ethical concerns.”
One of the emails in evidence quotes an alleged Meta employee futilely advising that “using pirated material should be beyond our ethical threshold” before arguing that databases like LibGen “are basically like PirateBay or something like that, they are distributing content that is protected by copyright and they’re infringing it.”
There are repeated examples of emails ascribed to Meta employees flagging the use of LibGen as a concern, either in failed “lone sane man fashion,” or in the context of hiding the activity. One researcher proposed only accessing LibGen through a VPN, and later joked that “torrenting from a corporate laptop doesn’t feel right 😂.”
Meta would ultimately operate in “stealth mode,” to quote one AI researcher at the company, concealing the activity by only downloading and seeding the torrents outside official Facebook servers. As an aside: It was real neighborly of them to seed the torrents too! Wonder how good their ratios were.
The prosecution further argues that these discovery documents suggest that Meta executives up to and including Mark Zuckerberg were aware of the use of pirated material to train AI models at the company. Another detail that stands out to me: The emails filed as evidence indicate that Meta employees believed OpenAI used LibGen for its own models, framing the company’s use of the database as a sort of arms race.
If the Internet Archive isn’t allowed to loan books as a digital library, I don’t think companies like Meta should be allowed to swallow up terabytes of pirated material to train a chatbot that will lie to you about how many planets are in the solar system. In a twist of fate, our international copyright regime looks to be one of the most sturdy bulwarks against an AI future. I’m no fan of the Digital Millennium Copyright Act, but I say let them fight.
One other thing I just can’t escape is how low-rent this all is: Our Silicon Valley thought leaders and mavericks need unprecedented injections of capital in order to… do internet piracy and conquer a new frontier in cheating on your homework? The sheer body of written communication allegedly confirming it all is just the cherry on top of a schadenfreude sundae. “Subject: Forwarded: Re:Re:Re:Re: Crimes.” I’m reminded of how Valve was saved from ruin by a similar disregard for opsec on the part of its former publisher Vivendi, or, indeed, that one I Think You Should Leave sketch.
2025 games: This year’s upcoming releases
Best PC games: Our all-time favorites
Free PC games: Freebie fest
Best FPS games: Finest gunplay
Best RPGs: Grand adventures
Best co-op games: Better together