Blocking of artificial intelligence system's web crawling legitimate?

Share This Article

The recent challenges to the web crawling by artificial intelligence systems like ChatGPT by French media raised questions on its compliance.

It is recent news that some French media have decided to block the web crawling tool “GPTBot,” used by OpenAI to train its Generative AI system like GPT4 powering ChatGPT, from accessing their websites. As artificial intelligence systems become more widespread and used in increasingly diverse areas, the collection of data indeed comes to assume a very different weight for newspapers. Some decide to stop it and others to exploit the opportunity.

The case between Open AI and French media

OpenAI had long stated that it was using the GPTBot to fuel the training of upcoming versions of their Generative Artificial Intelligence system, particularly GPT-5. This version will thus be able to build on much broader knowledge. A crawler is software that reads all the contents of a web page or database in an automated manner and also makes a copy of all the documents present, sorting them according to an index for ease of later use.

Since the release of the well-known ChatGPT system, OpenAI had disclosed that most of the data used to train the model came from the Internet, while stating a time coverage of the contents extending to September 2021.

Issues relating to data quality for Generative AI systems

The issues associated with this data collection for GPTs relate primarily to the quality of the data collected and analyzed. Poor data quality is a phenomenon that has increased with the spread and increased use of Big Data and has long been an obstacle to the healthy development of AI systems. For example, data collected from social platforms evidently have lower quality than data collected from articles published by newspapers, which are much more curated and possess higher value and quality.

The collection of data on the Internet also raises questions for supervisory and regulatory authorities. Data protection authorities have recently raised privacy concerns about the collection of data on social media and other public websites. Information publicly accessible on the Internet remains subject to data protection laws in any case. This type of practice exposes users to risks such as cyber attacks, identity theft, unauthorized surveillance, and unwanted marketing.

The reaction to Open AI’s web crawling

Faced with this indiscriminate collection of data, newspapers such as Radio France and TF1 discontinued the availability of their site to ChatGPT’s web-crawler and subsequently proposed an agreement to OpenAI that would guarantee them compensation. Other media outlets around the world such as, for example,

The New York Times and CNN also disabled GPTBot wanting to protect and avoid copyright infringement of content, but especially wanting to exclude the possibility that other companies, using OpenAI’s products, could benefit from the intellectual work done by newspapers.

The American Journalism Project, a major U.S. philanthropic organization that aims to rebuild and sustain local news has entered into an agreement with OpenAI to experiment with ways in which AI can support the news sector. The purpose of this partnership would be to improve local news realities, as, with the use of AI, newspapers could expand their capabilities.

Why web crawling by artificial intelligence systems might be legitimate

One of the main gaps challenged to the current version of the EU AI Act is the lack of coordination with copyright and data protection laws.

When discussing the potential copyright violations by artificial intelligence systems, a key consideration is the applicability of the text and data mining (TDM) exception outlined in the Copyright Directive 2019/790/EU. This exception allows for TDM activities on intellectual properties like software or databases, irrespective of the purpose or who conducts it, given:

The person has lawful access for TDM purposes.
The copyright owner hasn’t explicitly reserved rights against TDM activities (an opt-out mechanism). This implies TDM activities would fall under their exclusive control.

However, the extent of this opt-out mechanism is influenced by how the rights holder reserves it. Article 4(3) of the Copyright Directive mandates that online reservations must be machine-readable. Opting out can also be facilitated by incorporating a clause in a contract, a point confirmed by the Directive itself, which doesn’t mandate Article 4.

Additionally, the reservation’s nature is unrelated to the existence of mechanisms preventing data extraction. The reservation only serves an informative purpose. Hence, even if there are no protective measures, adding a reservation to the website’s R&D is enough.

The reservation can:

Be a digital statement without IT protection mechanisms, like the protocols in robots.txt files.
Be integrated into a digital rights management system that offers cyber protection and an automatically detectable declaration.
However, it can’t just be technical protection measures without a declaration. While these measures don’t outright make TDM unlawful, they can prohibit extractions conflicting with the chosen technical measure, as circumventing such measures is forbidden.

Another challenge is the retention of copies post-data mining. Reproductions can be held only as long as necessary for TDM. Hence, they can’t be kept for tasks beyond TDM, like validating results. Some believe copies for data mining can be retained for AI training, but it depends on whether AI training falls under TDM or a subsequent activity. If it’s the former, then copies might be kept during AI training.

The Directive doesn’t address the use of data post-computational analysis. Some experts suggest that leveraging data mining results might need the copyright owner’s permission. If only segments of content are mined, it’s essential to see if these segments are individually creative and protected. Some argue that using creative pieces doesn’t breach copyrights if the author’s intended meaning becomes unrecognizable in the new setting.

In summary, developers looking to train AI systems using copyrighted data should:

Secure legal access to the data.
Ensure rights holders haven’t excluded reproductions for TDM.
Retain copies only for the TDM duration.

The relevance of the TDM under the newly established regime of the EU AI Act

Under the current version of the EU AI Act, a disclosure of the IP protected material used for the training of artificial intelligence systems is likely to be required. This obligation risks to lead to major disputes, unless the disclosing party is able to maintain the legality of this conduct, relying for instance on the above mentioned TDM.

Such type of assessments are included in the compliance evaluations covered by DLA Piper’s PRISCA AI Compliance, a legal tech tool able to perform a maturity assessment of artificial intelligence solutions against the major regulatory obligations. You can read more on the topic HERE. There is no doubt that the web crawling by artificial intelligence systems might lead to potential challenges, and therefore companies exploiting AI shall have a valid defense.

Besides, you can find the following article interesting “€ 20 million privacy fine against Clearview AI facial recognition system in Italy“.

Authors: Giulio Coraggio and Marco Guarna

Photo by Nicolas Picard on Unsplash

(Visited 80 times, 1 visits today)

Is the blocking of artificial intelligence system’s web crawling legitimate?