Databricks has failed in its bid to throw out a class action brought by a group of authors who say the company helped build its DBRX large language model using pirated copies of their books. In a ruling last week, US District Judge Charles Breyer in Northern California said the writers had put forward enough to keep the case alive, allowing allegations involving roughly 196,000 titles to move forward.

The dispute centres on a chain of development that runs through MosaicML, the AI start-up Databricks bought in 2023. According to the materials described in the case, early Mosaic models were linked to the RedPajama dataset, which incorporated Books3, a collection later removed from Hugging Face over copyright concerns. Databricks has argued that the authors have not shown DBRX itself was trained on the disputed material, but the plaintiffs say the company copied the books during development regardless of what ended up in the final model.

Judge Breyer said the authors had tied their works directly to DBRX and that employee statements supported the inference that the challenged material mattered to the model’s development. The case now appears likely to turn on how much of the training pipeline Databricks can explain, and whether the court accepts the company’s claim that any use of the material was too remote to amount to infringement.

The potential financial exposure is substantial if the writers can persuade the court that any infringement was wilful. Brandon Butler, a copyright lawyer and executive director of Re:Create, told The Register that statutory damages in copyright law can be severe, with awards reaching six figures per work. Among the authors backing the suit are Jason Reynolds, Stuart O’Nan, Brian Keene and Rebecca Makkai, whose The Great Believers was a Pulitzer Prize finalist.

Databricks has not yet advanced the fair use defence that has helped other AI companies in related litigation. Meta won a similar case brought by authors last year, while Anthropic also prevailed on fair use in a separate matter, though it later agreed to set up a $1.5 billion compensation fund after admitting it had ingested pirated books. For now, Databricks is left to fight on with a narrower procedural argument, while the authors insist the company cannot lawfully download, store and reuse pirated works simply because they were not retained in the final training set.

Source Reference Map

Inspired by headline at: [1]

Sources by paragraph:

Source: Noah Wire Services