Generative AI has become as much a contest over data as over algorithms. As models improve, the quality, breadth and legality of the material used to train them are increasingly shaping which companies can move fastest and which must slow down to manage risk. That shift has made proprietary datasets, licensing deals and compliance strategy central to the next phase of AI development.
The dispute is now colliding with copyright law. In June 2025, judges in California took different approaches in two closely watched cases, with one ruling that training on copyrighted books could qualify as transformative fair use, while still leaving room for claims tied to pirated copies, and another finding that Meta’s use of books from shadow libraries was sufficiently transformative and did not harm the market. Earlier, the Delaware federal court also rejected a fair use defence in the Thomson Reuters v. ROSS Intelligence case, underscoring that the legal landscape remains unsettled and highly fact-specific.
That uncertainty is pushing technology firms away from indiscriminate scraping and towards more controlled data strategies. Companies are investing in licensed material, internal datasets, synthetic data and other forms of data augmentation in order to reduce exposure while preserving model performance. For larger players, that can mean exclusive partnerships and negotiated rights; for smaller companies, it raises the bar for entry because compliant data can be expensive to secure at scale.
Regulators and policymakers are adding to the pressure. The U.S. Copyright Office has indicated that some AI training uses may fall within fair use, but it has also warned of possible market harm to creators and pointed to voluntary licensing as a practical response. A Congressional Research Service brief similarly notes that fair use is a flexible doctrine that turns on context, including purpose, amount used and impact on the market. In this environment, companies are folding legal review deeper into product development rather than treating it as a final checkpoint.
The broader debate is no longer limited to copyright alone. Publishers and other rights holders are also pressing for clearer consent and compensation, while businesses are being asked to address bias, transparency and accountability in the datasets that shape their systems. Surveys cited in the sponsored material suggest public unease about trusting AI, reinforcing the case for stronger governance. The result is a more cautious, more commercial and more legally aware AI industry, where access to data may matter as much as model design itself.
Source Reference Map
Inspired by headline at: [1]
Sources by paragraph:
- Paragraph 1: [2], [3]
- Paragraph 2: [2], [4], [7]
- Paragraph 3: [1], [5], [7]
- Paragraph 4: [4], [5], [7]
- Paragraph 5: [1], [3], [5]
Source: Noah Wire Services