Politics

Databricks lawsuit over pirated books advances as court rejects bid to dismiss

Wednesday, 29 April 2026 7:11PM UTC

A US court allows authors to proceed with a class action accusing Databricks of using pirated books to develop its large language model, marking a significant turn in AI copyright disputes.

Databricks has failed in its bid to throw out a class action brought by a group of authors who say the company helped build its DBRX large language model using pirated copies of their books. In a ruling last week, US District Judge Charles Breyer in Northern California said the writers had put forward enough to keep the case alive, allowing allegations involving roughly 196,000 titles to move forward.

The dispute centres on a chain of development that runs through MosaicML, the AI start-up Databricks bought in 2023. According to the materials described in the case, early Mosaic models were linked to the RedPajama dataset, which incorporated Books3, a collection later removed from Hugging Face over copyright concerns. Databricks has argued that the authors have not shown DBRX itself was trained on the disputed material, but the plaintiffs say the company copied the books during development regardless of what ended up in the final model.

Judge Breyer said the authors had tied their works directly to DBRX and that employee statements supported the inference that the challenged material mattered to the model’s development. The case now appears likely to turn on how much of the training pipeline Databricks can explain, and whether the court accepts the company’s claim that any use of the material was too remote to amount to infringement.

The potential financial exposure is substantial if the writers can persuade the court that any infringement was wilful. Brandon Butler, a copyright lawyer and executive director of Re:Create, told The Register that statutory damages in copyright law can be severe, with awards reaching six figures per work. Among the authors backing the suit are Jason Reynolds, Stuart O’Nan, Brian Keene and Rebecca Makkai, whose The Great Believers was a Pulitzer Prize finalist.

Databricks has not yet advanced the fair use defence that has helped other AI companies in related litigation. Meta won a similar case brought by authors last year, while Anthropic also prevailed on fair use in a separate matter, though it later agreed to set up a $1.5 billion compensation fund after admitting it had ingested pirated books. For now, Databricks is left to fight on with a narrower procedural argument, while the authors insist the company cannot lawfully download, store and reuse pirated works simply because they were not retained in the final training set.

Source Reference Map

Inspired by headline at: ^[1]

Sources by paragraph:

Paragraph 1: ^[2], ^[6]
Paragraph 2: ^[1], ^[4], ^[5]
Paragraph 3: ^[1], ^[6]
Paragraph 4: ^[1], ^[2], ^[3]
Paragraph 5: ^[1], ^[2], ^[6]

Source: Noah Wire Services

More on this

https://www.theregister.com/2026/04/29/databricks_author_copyright_lawsuit_continues/ - Please view link - unable to able to access data
https://www.theregister.com/2026/04/29/databricks_author_copyright_lawsuit_continues/ - Databricks is facing a class action lawsuit from several authors who allege that their copyrighted books were used without permission to train Databricks' large language model (LLM), DBRX. The authors claim that the model was developed using a database containing pirated versions of approximately 196,000 titles. Judge Charles Breyer of the U.S. District Court in Northern California denied Databricks' motion to dismiss the case, allowing the lawsuit to proceed. The case centres on whether Databricks' DBRX model was trained using the infringing data, with potential damages being significant if the authors can prove willful infringement.
https://letsdatascience.com/news/databricks-faces-authors-copyright-claims-over-dbrx-6b81b8b8 - Databricks is facing a class action lawsuit from authors who allege that their copyrighted works were used without permission to train Databricks' large language model (LLM), DBRX. The plaintiffs, including notable authors, claim that the model was trained using a dataset containing pirated versions of approximately 196,000 titles. The lawsuit survived Databricks' motion to dismiss, allowing the case to proceed. The central issue is whether Databricks' DBRX model was trained using the infringing data, with potential damages being substantial if the authors can prove willful infringement.
https://www.saverilawfirm.com/databricks-inc.-large-language-model-litigation - A class action lawsuit has been filed against Databricks and its subsidiary MosaicML, alleging that their large language models (LLMs) were trained using pirated books without the authors' permission. The plaintiffs claim that MosaicML's MPT models, including MPT-7B and MPT-30B, were trained on a dataset called RedPajama—Books, which is a copy of the Books3 dataset. The Books3 dataset was removed from Hugging Face in October 2023 due to copyright infringement concerns. The lawsuit seeks damages and an injunction to prevent further infringement.
https://evan.law/2025/06/26/court-lets-authors-expand-copyright-case-to-target-databricks-new-ai-models/ - In June 2025, authors expanded their copyright infringement lawsuit against Databricks and MosaicML to include Databricks' new AI models, DBRX. The plaintiffs allege that their copyrighted works were used without permission to train these models. The court allowed the amendment to the complaint, enabling the authors to target Databricks' new AI models in their lawsuit. The case continues to develop, with the central issue being whether Databricks' DBRX models were trained using the plaintiffs' copyrighted data.
https://docs.justia.com/cases/federal/district-courts/california/candce/3%3A2024cv01451/426188/288 - In April 2026, Judge Charles R. Breyer of the U.S. District Court for the Northern District of California denied Databricks' motion to dismiss the copyright infringement claims brought by authors against Databricks and MosaicML. The plaintiffs allege that their copyrighted works were used without permission to train Databricks' large language models, including DBRX. The court's decision allows the lawsuit to proceed, with the central issue being whether Databricks' models were trained using the plaintiffs' copyrighted data.
https://www.mishcon.com/generative-ai-copyright-case-and-policy-tracker - A class action lawsuit has been filed against Databricks and its subsidiary MosaicML, alleging that their large language models (LLMs) were trained using pirated books without the authors' permission. The plaintiffs claim that MosaicML's MPT models, including MPT-7B and MPT-30B, were trained on a dataset called RedPajama—Books, which is a copy of the Books3 dataset. The Books3 dataset was removed from Hugging Face in October 2023 due to copyright infringement concerns. The lawsuit seeks damages and an injunction to prevent further infringement.

Noah Fact Check Pro

The draft above was created using the information available at the time the story first emerged. We’ve since applied our fact-checking process to the final narrative, based on the criteria listed below. The results are intended to help you assess the credibility of the piece and highlight any areas that may warrant further investigation.

Freshness check

Score: 10

Notes: The article is dated April 29, 2026, and reports on a recent legal development, indicating high freshness. No evidence of recycled or outdated content was found.

Quotes check

Score: 8

Notes: The article includes direct quotes from Judge Charles Breyer and Brandon Butler. While the quotes are attributed, they are not independently verifiable through the provided sources. This raises concerns about the ability to confirm the accuracy of these statements.

Source reliability

Score: 7

Notes: The primary source is The Register, a technology news website. While it is a known publication, it is not as widely recognized as major outlets like the BBC or Reuters. The article cites additional sources, including legal filings and statements from Brandon Butler, a copyright lawyer. However, the reliance on a single source for direct quotes and the lack of independent verification of these quotes reduce the overall reliability.

Plausibility check

Score: 9

Notes: The claims about the lawsuit and the involvement of authors like Jason Reynolds and Rebecca Makkai are plausible and align with known information. However, the inability to independently verify direct quotes from Judge Breyer and Brandon Butler introduces some uncertainty.

Overall assessment

Verdict (FAIL, OPEN, PASS): CONDITIONAL

Confidence (LOW, MEDIUM, HIGH): MEDIUM

Summary: The article provides timely and plausible information about the Databricks copyright lawsuit, with direct quotes from Judge Charles Breyer and Brandon Butler. However, the inability to independently verify these quotes and the reliance on a single source for direct quotes reduce the overall reliability. Given these concerns, the content is conditionally acceptable, provided that the direct quotes are independently verified before publication.

AI legal cases
Copyright infringement
Databricks