Share This Article
Can generative artificial intelligence systems’ use of data, images and content for their own training rely on the new dedicated text and data mining (TDM) exception introduced by the Copyright Directive?
Generative AI systems “self-train” using machine learning algorithms that analyze huge amounts of data, images and content and learn to use that information to create new content similar to existing content.
Such analysis, however, could be seen as reproducing, even if only temporarily, the data and sources used, including any protected works or entire portions of the databases employed. Therefore, problems of coordination with the regulations protecting copyright and related rights – in particular, the exclusive right of reproduction under the EU Copyright Directive implemented in Italy through Law No. 633/1941 – may arise from the automated extraction of such content. But not only that. Such use could also conflict with the right of the maker of a database to prohibit the extraction or reuse of all or a substantial part of it.
In the context of copyright law, the doctrine has questioned the possibility of creative elaboration of the protected information and/or work. On this point, in fact, the European legislator has already expressed itself, according to which, in the process of processing data, the absence of authorization from the author of the work from which they are extracted may constitute copyright infringement. However, it is clear that making the activity of extracting data and content subject to the prior obtaining of authorization from the owner of the copyrights involved would entail high transactional costs and also timeframes incompatible with those of developing artificial intelligence systems. It is precisely for these reasons that the European legislator intervened by reforming the matter through the introduction of certain exceptions and limitations to copyright that are mandatory for each Member State.
Specifically, in the area of data mining, the Copyright Directive 2019/790/EU introduced the so-called text and data mining (TDM) exceptions, regulated in Articles 3 (Text and data mining for scientific research purposes) and 4 (Exceptions or limitations for the purposes of text and data mining). TDM is defined in Article 2 of the Copyright Directive as “any automated analysis technique aimed at analyzing text and data in digital format having the purpose of generating information including, but not limited to, patterns, trends and correlations.” At the national level, these articles have been transposed in Italy, respectively, with the introduction into the Italian Copyright Law of Articles 70-ter-which deals only with extraction for scientific purposes by research organizations and cultural heritage protection institutions-and 70-quater-which allows the extraction of text and data in general, by anyone, even for mere profit.
Given the large amounts of data used by AI systems to generate new content, the close relationship between generative AI and the TDM exception is evident: the text and data mining exception allows AI systems to access large amounts of data, which are used by generative AI to create new content. Should these systems not be allowed to access such data, their ability to generate content would undoubtedly be limited.
Among the two TDM exceptions regulated by the European legislature, the second one, which also allows mining for profit, deserves particular attention. Indeed, it exempts any text and data mining activity that is carried out on the intellectual work, including software or database protected by a related right, regardless of the purpose or the qualification of the person exercising it.
This, however, provided that:
- such person has had lawful access to the content for the purpose of text and data extraction; and
- the owner of the copyright and related rights and/or the database owner has not expressly reserved the extraction of text and data (so-called opt-out mechanism), thereby bringing TDM’s activities under its exclusive control.
However, the liberalizing scope of the opt out mechanism granted depends on the manner in which the reservation is made by the rights holder. It is Article 4(3) of the Copyright Directive itself that requires that the reservation be made “in an appropriate manner, for example, by means of tools enabling automated reading in the case of content made publicly available online.” This provision thus seems to require that the reservation statement be machine-readable when the work to which it refers is made available to the public on the Internet. The effects of opting out can actually also result from the inclusion of an appropriate clause in a contract, an assumption moreover confirmed by the Copyright Directive itself, which does not include Article 4 among the mandatory rules.
Moreover, the qualification of the reservation statement is independent of any assessment regarding the possible presence of computer mechanisms to prevent data extraction. This interpretation is based on the merely informative function of the reservation. Thus, it will be sufficient to include the reservation in the R&D of the website, even if it lacks protective measures.
Therefore, the reservation
- will be able to be a “digital” declaration devoid of IT protection mechanisms, such as the exclusion protocols contained in robots.txt files; or
- may be achieved through the affixing of a digital rights management system that not only has a cyber protection function but also incorporates an automatically detectable cyber declaration; and
- on the other hand, it may not consist of the mere affixing of technical protection measures that do not include any declaration, and which therefore turn out to be mere tacit manifestations of will. Thus, the affixing of technical measures does not have the effect of making any TDM activity per se unlawful, but it does, however, make extractions incompatible with the technical measure adopted prohibited, since Article 174-ter prohibits circumventing technological protection measures.
A further problematic issue concerns the preservation of copies after data mining has concluded. With respect to this, it is provided that reproductions and extractions “may be retained for as long as necessary for the purpose of text and data extraction,” this is because the functionality of a copy to the extraction of text or data ceases at the time it is accomplished. Therefore, copies may not be retained for purposes beyond that of TDM, such as to verify and demonstrate achievements. There is, however, part of the doctrine that argues that replicas for data mining can also be retained for as long as necessary to train artificial intelligence systems. With respect to this, it would actually need to be checked on a case-by-case basis whether AI training constitutes text and data mining or whether, instead, it constitutes an activity subsequent to it. Only in the former case could copies be retained even during the AI training phase.
The provision, however, omits to regulate the reproductions and any further uses necessary for the use of the text and data extracted as a result of their computational analysis, that is, the use that AI systems could potentially make of them. On this point, part of the doctrine has noted that the use of the result of data mining could be conditioned on the permission of the owner of the rights to the analyzed content. When only the form or a portion of it is extracted with data mining, it must be examined whether the extracted and reused fragments constitute independently creative and therefore protected portions. With respect to this question, there are those who believe that the use of creative fragments does not interfere with copyright when their original meaning imprinted by the author is no longer understandable, for example, because in the new context such fragments are unrecognizable.
Therefore, developers who intend to use copyrighted works to train a generative AI system will need to follow three steps:
- obtain legitimate access to the data;
- verify that the rights holders have not reserved the right to make reproductions for TDM purposes;
- keep the copies made only as long as necessary for TDM purposes.
Clearly, an eye will need to be kept on future case law to understand the concrete ways in which these requirements will be applied.
On a similar topic, the following article may be of interest “Unlocking the Potential of Generative Artificial Intelligence (AI): Navigating the Legal Issues and Unleashing Its Creativity”.
Authors: Carolina Battistella and Elena Varese from DLA Piper