Share This Article
The Italian Privacy Authority, the Garante, released an information note with detailed guidelines on how to defend personal data published online by public and private entities from web scraping as part of the training of artificial intelligence (AI) systems.
These are advisory and not mandatory guidelines that can serve as a useful benchmark for data controllers wishing to better protect personal information published online. The document reflects the contributions received from the Authority during an investigative inquiry that began last December and includes preliminary guidelines while the Authority prepares to make decisions on various ongoing investigations regarding AI systems.
Definition of Web Scraping and Identification of the Phenomenon
The Data Protection Authority has defined web scraping as the activity of massive and indiscriminate data collection, including personal data, through web crawling techniques. This practice involves not only the collection but also the storage and preservation of data collected by bots for subsequent uses, such as training generative artificial intelligence systems. The document released by the Authority provides a detailed analysis of the phenomenon, noting that a significant portion of Internet traffic is generated by bots and that the data collected are often used to train AI models.
Measures Suggested by the Garante against Web Scraping by AI
To counter this phenomenon, the Authority has suggested several measures:
- Creation of Restricted Areas: Limiting data access to registered users only reduces public data availability and the risk of scraping, in compliance with the GDPR’s data minimization principle and avoiding unnecessary data duplication.
- Clauses in Terms of Service: Including an explicit ban on scraping techniques in the Terms of Service can serve as a legal deterrent, allowing for legal action in case of breaches.
- Network Traffic Monitoring: Analyzing HTTP requests to identify anomalous data flows and implementing countermeasures such as Rate Limiting can prevent unauthorized access.
- Bot Interventions: The use of CAPTCHAs and periodic updates to the HTML markup can hinder bot activity, as can embedding data in multimedia objects to complicate data extraction.
- Use of the robots.txt File: Although based on voluntary compliance by bots, this file can indicate not to index or collect certain data.
However, it is crucial to recognize that none of these measures can guarantee complete protection against web scraping. Therefore, they should be considered precautionary tools that data controllers must evaluate and adopt based on the principle of accountability, to prevent unauthorized use of personal data by third parties.
Other European approaches and what to expect next
This is not the first time a data protection authority has taken a stance on web scraping. On May 1, 2024, the Dutch Data Protection Authority issued similar guidelines, clarifying that data scraping includes not only the automated collection of information from web pages but also the collection of customer queries and complaints, or the monitoring of online messages for reputation management. The Dutch authority emphasizes the need to conform this practice to the GDPR, ensuring that there is an adequate legal basis for processing each category of personal data subject to scraping.
Generative artificial intelligence offers enormous benefits, but training these systems requires a huge amount of data, often collected through web scraping. It is crucial for website managers to adopt appropriate measures to protect personal data, balancing the need for innovation with the protection of individual privacy.
Although implementing measures like captchas is recommended to defend personal data on online platforms, it is important to recognize that such solutions may not always be effective. Modern AI bots, for example, can now easily overcome captcha systems, highlighting the need for more sophisticated and multilayered security strategies.
Faced with these challenges, it is essential that companies do not rely solely on standardized solutions like captchas but explore more advanced and customized approaches to data protection. This can include the combined use of behavioral navigation analysis, multi-factor authentication, and continuous monitoring of suspicious activities to create a safer and more resilient environment against more sophisticated attacks.
On a similar topic, you can read the article โThe EDPB Publishes ChatGPT Taskforce Report Revealing Major Challenges for GenAI Privacy Complianceโ.