Large Language Models and the Fair Use Doctrine

Author: Multi-Agent AI Assistant
Date: 07/19/2025

Abstract

The rapid evolution of artificial intelligence, particularly large language models (LLMs), has brought new complexity to the interpretation of the fair use doctrine under United States copyright law. This article examines recent judicial decisions, scholarly perspectives, and ongoing policy debates concerning whether and to what extent the use of copyrighted works to train LLMs constitutes fair use. The analysis centers on the key legal factors, the transformative nature of LLM training, the distinction between legitimate and pirated sources, and the implications for authors, technology developers, and the broader research community.

Introduction

The emergence of large language models (LLMs) and generative artificial intelligence (Gen-AI) has fundamentally altered the landscape of information creation and dissemination. These models are frequently trained on massive datasets that include works protected by copyright. As a result, courts, policymakers, and stakeholders are grappling with the question: Does the use of copyrighted content for AI training fall within the fair use exception, or does it constitute infringement? Recent case law and commentary have begun to define, but not conclusively resolve, this question.

Recent Legal Developments

Key Judicial Decisions

Two pivotal decisions from the Northern District of California—Bartz v. Anthropic PBC (June 23, 2025) and Kadrey v. Meta Platforms, Inc. (June 25, 2025)—have provided significant but nuanced guidance on the fair use doctrine as applied to LLM training (Skadden, 2025; Husch Blackwell, 2025; Norton Rose Fulbright, 2025).

Bartz v. Anthropic PBC

In Bartz, authors alleged that Anthropic PBC infringed their copyrights by using both legitimately purchased and pirated copies of their books to train its LLM. Judge Alsup held that using copyrighted works for LLM training was "quintessentially transformative" and thus supported a finding of fair use. However, the decision carefully distinguished between:

Legitimately purchased works: Digitizing purchased works for LLM training was deemed fair use.
Pirated works: Incorporating pirated works into a permanent library was not fair use, as it displaced demand for legitimate copies and did not transform the original works in a legally significant way.

Kadrey v. Meta Platforms, Inc.

In Kadrey, a group of authors sued Meta for using their works (obtained without permission) to train its LLMs. Judge Chhabria agreed that the use was transformative but expressed concern about possible market harm if LLMs can generate works similar to those of the original authors. However, in the absence of concrete evidence that the LLM outputs actually harmed the market for the original works, the court ultimately found that fair use applied.

Common Themes and Divergences

Fact-Specific Analysis:
Both courts stressed that fair use determinations are highly fact-dependent. The holdings do not establish binding, broad precedents and future cases could be decided differently with stronger evidence of harm or different fact patterns (Skadden, 2025).

Transformative Use:
The transformative nature of LLM training—using works to teach an AI to understand language rather than to merely reproduce or replace the works—strongly favors fair use under §107 of the Copyright Act (Norton Rose Fulbright, 2025).

Market Impact:
A lack of evidence regarding market harm or infringing outputs was decisive in both cases. The courts suggested that future plaintiffs with direct evidence of economic harm or substantial copying in AI-generated outputs might prevail (Skadden, 2025; Husch Blackwell, 2025).

Pirated vs. Legitimate Sources:
Both courts drew a sharp line between the use of legitimately acquired materials and pirated content. Fair use did not protect the creation of a permanent library from pirated works, but allowed the digitization and use of legitimately purchased works for training purposes (Norton Rose Fulbright, 2025).

The Four-Factor Fair Use Test in the LLM Context

The fair use analysis, codified in 17 U.S.C. §107, considers four factors:

Purpose and Character of the Use:
Both courts found LLM training to be highly transformative, as it repurposes works for machine learning rather than for traditional reading or consumption by humans. The transformative character weighed heavily in favor of fair use.
Nature of the Copyrighted Work:
The works at issue (books, literary works) are highly expressive, which generally weighs against fair use. However, this factor was not dispositive given the transformative purpose (Skadden, 2025).
Amount and Substantiality Used:
While LLM training typically involves copying entire works, both courts accepted that this was reasonably necessary for the transformative purpose of training a model—not for reproducing or distributing the original works (Husch Blackwell, 2025).
Effect on the Market:
The courts required concrete evidence of market harm, either through outputs that compete with the original works or through displacement of sales. The use of pirated works for building a permanent library was found to directly harm the market by substituting for purchases; otherwise, without evidence of such harm, the market impact factor did not weigh against fair use.

Precedent and Scholarly Perspectives

Legal scholars and advocacy groups have pointed to earlier cases such as Authors Guild v. HathiTrust and Authors Guild v. Google as supporting the fair use defense for AI training (Association of Research Libraries, 2024). These cases held that mass digitization for non-expressive, analytical purposes was fair use, provided that no meaningful amounts of the original works were made available to the public.

The Library Copyright Alliance (LCA) and the Association of Research Libraries argue that applying fair use to LLM training is essential for enabling research, education, and access to information. Restricting LLM training to public domain works would undermine the scope and utility of AI for contemporary research and cultural analysis (Association of Research Libraries, 2024).

The Distinction Between Inputs and Outputs

A critical distinction in the law and policy debates is between:

Input: The use of copyrighted works to train an LLM (generally considered transformative and fair use, absent market harm or direct substitution).
Output: The texts or content generated by the LLM. If outputs reproduce substantial portions of copyrighted works, they may be infringing even if the training itself was fair use.

Both courts and commentators recognize that liability for infringement may ultimately hinge less on how AI is trained and more on how its outputs are used and distributed (Skadden, 2025; Association of Research Libraries, 2024).

Market Harm: Direct and Indirect Substitution

A contentious issue is whether LLMs indirectly harm the market for original works by enabling the rapid, automated creation of derivative or substitutive content. Judge Chhabria in Kadrey suggested that, in future cases, evidence of such indirect substitution could tip the balance against fair use. However, absent evidence, courts have so far declined to find market harm on this basis.

Both judges rejected the argument that the potential to license works for AI training constitutes market harm, calling this reasoning circular because it presupposes that licensing is required when the use may be fair (Skadden, 2025; Norton Rose Fulbright, 2025).

Policy Implications

Support for Fair Use in AI Training:
Research, educational, and library communities argue that requiring licenses for all copyrighted materials used in AI training would stifle innovation, restrict research, and limit the representativeness of AI models. Courts have thus far not been persuaded that adverse rulings would "thwart innovation," instead suggesting that companies can compensate rights holders if necessary (Skadden, 2025; Association of Research Libraries, 2024).

Concerns from Rights Holders:
Authors and publishers express concern over uncompensated use of their works and the potential for AI to erode the economic value of original creation by flooding the market with derivative content. The courts' focus on evidence-based harm means these concerns may be addressed in future litigation where stronger records are presented.

Conclusion

The application of the fair use doctrine to LLM training is a rapidly evolving and fact-specific area of law. Recent decisions have generally found that the transformative use of copyrighted works for LLM training is fair use—so long as there is no evidence of infringing outputs or market harm, and provided that the training data is lawfully obtained. However, these holdings are narrow and future cases with different facts or stronger evidence could yield different results. The debate continues as stakeholders seek to balance innovation, research, and the rights of creators in the age of artificial intelligence.

References

Association of Research Libraries. (2024, January 23). Training generative AI models on copyrighted works is fair use. https://www.arl.org/blog/training-generative-ai-models-on-copyrighted-works-is-fair-use/

Husch Blackwell. (2025, July 2). Recent decisions clarify fair use doctrine in AI context. https://www.huschblackwell.com/newsandinsights/recent-decisions-clarify-fair-use-doctrine-in-ai-context

Norton Rose Fulbright. (2025, July 8). Two US decisions find that reproducing works to train large language models is fair use – Part 1: Bartz v Anthropic. https://www.nortonrosefulbright.com/en-us/knowledge/publications/4a4e1a04/two-us-decisions-find-that-reproducing-works-to-train-large-language-models-is-fair-use-part-1-bartz-v-anthropic

Skadden, Arps, Slate, Meagher & Flom LLP. (2025, July 8). Fair use and AI training: Two recent decisions highlight the complexity of this issue. https://www.skadden.com/insights/publications/2025/07/fair-use-and-ai-training-two-recent-decisions

This article was prepared in accordance with the APA Style Guide (7th ed.), as of July 19, 2025.

Multi-Agent AI: Deep Research