Share This Article
The AI Act template for training data is finally here—and it’s about to reshape how disputes between copyright holders and AI developers unfold in the EU.
Below, we delve into what the new disclosure template means, why it’s stirring controversy, and how it could reshape the entire AI landscape in Europe and beyond. You can listen to the episode of the podcast on the topic below and on Apple Podcasts, Google Podcasts, Spotify, and Audible and read the article below that was updated on 24 July 2025 after the publication of the final version of the EU copyright disclosure template under the AI Act:
On 24 July 2025, the European Commission released the long-anticipated template that general-purpose AI (GPAI) model providers must use to disclose the data used to train their models. While its name might suggest another bureaucratic compliance form, this AI Act training template is likely to have profound implications for copyright enforcement, transparency, and future litigation strategy.
What Is the AI Act Template for Training?
Under Article 53(1)(d) of the AI Act, providers of a general-purpose AI model available in the EU must publicly share a summary of the training data used. This includes data sourced from:
- Public datasets (like Common Crawl);
- Websites scraped by the provider or their agents;
- User interactions (e.g., prompts or clicks);
- Synthetic content generated by other AI systems; and
- Licensed datasets or private archives.
The template breaks down training content into multiple sections and requires providers to list (among other things) the top 10% of scraped domains, which can include news outlets, blogs, image repositories, and other content-rich sites. The goal? To enable copyright holders and other stakeholders to verify the proper usage of their content.
A critical ambiguity arises here: What does “top 10% of all domain names determined by the size of the content scraped” actually mean? Is this 10% calculated per modality—text, audio, video—or across all modalities collectively? The answer materially affects the visibility of text-based content versus heavier formats like video or high-resolution images, which could otherwise dominate the list due to sheer file size. Besides, even the metric for “size” is unclear. Is it raw megabytes or number of extractable tokens? The latter would better reflect the training influence of a dataset—particularly for text—while avoiding distortions caused by file format.
How the Template Could Fuel Copyright Litigation
The core innovation of this AI Act training template lies in its potential to facilitate copyright disputes. Here’s how:
1. A New Layer of Evidence
Before this template, it was nearly impossible for copyright holders to know whether their works had been used to train AI models. Now, if a model’s summary lists specific domains that’s a clear lead for rights holders to investigate.
2. A Stronger Legal Argument
While the summary won’t list individual copyrighted works, it will show where the data came from—and that may be enough. In EU courts, circumstantial evidence (especially when coupled with TDM opt-out notices) can help build a case of unauthorized use.
3. The Basis for Access Requests
The AI Act encourages AI developers to respond “in good faith” to requests from rights holders regarding specific domain use. That could form a soft-law practice—eventually giving rights holders broader access to training data documentation.
The template could indeed become a powerful tool in copyright enforcement. It mandates disclosure of crawler behavior, including whether the provider respects TDM opt-outs and how it handles restricted content like paywalled sites. This could open the door to liability where such protections are disregarded.
Where the Copyright Template Leaves Gray Areas
While the AI Act copyright training template is a major transparency step, it also contains several legal ambiguities that could shape future disputes:
1. No Requirement to List Individual Copyrighted Works
Providers must summarize domains and datasets, but they’re not obligated to list specific works. This means a poet, photographer, or journalist still won’t know for sure if their creation was used unless it was clearly traceable to a disclosed source.
Will courts interpret domain-level disclosure as presumptive use of all content hosted there?
2. Trade Secret Protections May Limit Transparency
The European Commission tries to balance openness with the protection of trade secrets. Providers may use general descriptions for private datasets and justify redactions. But how will this affect copyright enforcement?
This balance may be tested in court. Some AI developers could challenge the template itself, arguing that it compels disclosure of sensitive business information in violation of the AI Act or beyond what the law intended by “narrative” summary. While the Commission claims the design respects confidential information, the business community may not be so convinced.
Can a provider refuse to disclose full source details citing trade secrets—if that obstructs a rights holder’s legal remedy?
3. Voluntary vs. Mandatory Disclosure
Many sections of the template allow optional, narrative-level disclosure. Providers may choose to omit deeper technical information. Rights holders may argue in future proceedings that selective disclosures reflect an intent to obscure unlawful training practices.
Will silence in optional sections be used as evidence of non-compliance or copyright infringement?
Strategic Moves for Copyright Holders
If your company is a rights holder, the publication of this training template under the AI Act is your opportunity to act.
Here’s what to do next:
-
Audit Published Summaries
Monitor whether their content sources (or client platforms) appear on domain lists. From 2 August 2025, every GPAI model put on the market from that date must include a public training data summary while for GPAI models already in the market the obligation will become applicable from 2 August 2027. But a summary may become due because of changes implemented to an exising GPAI model.
-
Prepare Targeted Requests
Use the template to request further data from providers. The European Commission encourages voluntary cooperation for domain-specific questions—particularly if you assert legitimate copyright claims.
-
Document Your TDM Opt-Outs
Under EU copyright law, right holders can block automated training use via the EU Text and Data Mining (TDM) opt-outs (robots.txt, metadata). The template asks providers to explain how they respected such notices—any omission could become a legal vulnerability.
-
Get Ready for Litigation
With summaries published, claims under the IP Enforcement Directive (Directive 2004/48/EC) become more plausible. Rights holders will be able to argue that model providers knew or should have known that content was protected.
Strategic Implications for AI Providers
If you’re developing a GPAI model, this AI Act training and copyright template is not just a box to check—it’s a legal tool that can protect or expose your business.
Consider this:
- Be Transparent—but Strategic
Disclose enough to show compliance, but avoid unnecessary risks. If a dataset isn’t essential to name (because it’s private and not legally required), justify the omission clearly.
- Respond to Rights Holders
Engaging proactively with content owners shows good faith. Ignoring requests—even voluntary ones—could escalate quickly to legal threats.
- Update Your Policies
The AI Act mandates a copyright policy. Align your practices with opt-out compliance under the TDM exception, and document your efforts to remove illegal or unauthorized content.
The Commission also made clear that the template will apply to downstream developers who modify a GPAI model substantially (e.g. over one-third of its structure). This means even companies fine-tuning or adapting open-source models could become liable for new summaries—and that creates ripple effects for compliance teams across the value chain.
How Can AI Providers Challenge Copyright Claims Relying on the TDM Exception
One of the few legal defenses available to AI developers facing copyright claims in the EU is the Text and Data Mining (TDM) exception under Article 4 of the Copyright in the Digital Single Market Directive (Directive (EU) 2019/790). This exception allows the reproduction and extraction of lawfully accessible works for the purposes of data mining, including for training AI systems—unless the rightsholder has explicitly reserved their rights.
1. Legal Basis for TDM in AI Training
Under EU law:
- Article 3 of the Directive provides a mandatory TDM exception for research organizations and cultural heritage institutions;
- Article 4 provides a broader exception that applies to any user—including commercial AI providers—unless the rights holder has explicitly opted out.
Therefore, AI providers may lawfully use copyrighted material for training if:
- They accessed the data lawfully (e.g., scraping public websites not behind a paywall);
- The data was not subject to a TDM opt-out, clearly expressed through metadata or robots.txt exclusions;
- The material was used strictly for text and data mining purposes and not republished or otherwise exploited commercially beyond model training.
2. Leveraging the TDM Exception in Disputes
In the context of copyright enforcement following the publication of a model’s training summary, AI providers can raise the TDM exception as a defense, arguing that:
- The data was accessed legally and used strictly for machine learning purposes;
- The website or database in question did not include a valid opt-out;
- The provider honored opt-outs where technically feasible and in line with best practices;
- Their documentation (as now published via the AI Act training and copyright template) demonstrates good-faith compliance.
3. Practical Recommendations for AI Providers
To rely on this defense effectively, AI developers should:
- Maintain logs of crawler activity and respect of robots.txt files or HTTP headers expressing TDM opt-outs;
- Implement copyright policies aligned with Article 53(1)(c) of the AI Act, explaining how opt-outs are detected and enforced;
- Disclose crawler behavior in the template, including treatment of paywalled or opt-out-protected content;
- Avoid training on obviously protected works (e.g., behind paywalls or marked as licensed for specific use only) without explicit licenses;
- Process content for the limited duration necessary for the AI training;
- Record compliance efforts as part of risk mitigation and audit trails.
Limitations of the TDM Exception
While powerful, the TDM exception is not absolute. It cannot override:
- Effective opt-outs, such as metadata and robots.txt instructions;
- Terms of service that explicitly prohibit data mining;
- Data protection laws, if personal data was used unlawfully in training.
Moreover, the burden of proof often shifts to the AI provider once a rights holder makes a substantiated claim—especially under Article 8 of the Enforcement Directive, which grants courts the ability to order detailed disclosures.
Is the AI Act Template the Start of a Copyright Revolution?
The AI Act template for training data is not just about compliance—it is the first regulatory bridge between the world of generative AI and traditional copyright law. It gives creators a tool to assert their rights and challenges AI companies to justify their practices.
One unresolved question is whether the template itself could face legal scrutiny. As a mandatory tool, it could be challenged in court by AI providers that believe it goes beyond the scope of the AI Act or infringes on trade secret protections. If such a case were to succeed, it would dramatically alter the enforcement of Article 53(1)(d).
Will this lead to more lawsuits? Almost certainly. But it may also drive a new licensing ecosystem, where creators, platforms, and model developers negotiate access rather than fight it out in court.
The open question remains: Can transparency alone create fairness—or do we need stronger enforcement mechanisms and global alignment?
This story is just beginning.
You can read several articles on the most relevant issues of artificial intelligence HERE.