The Internet Archive is, let’s be real, a very good website. It offers free access to vast amounts of media, books, software, and cultural artifacts. It also preserves rare and lost content that might otherwise disappear forever. For researchers, historians, and everyday users, it’s an invaluable resource.
But the Internet Archive isn’t just about media preservation, it also operates the Wayback Machine, a powerful tool that automatically crawls websites, capturing and storing snapshots at specific points in time. This is where things get complicated.
If a website ever exposed your personal data publicly, whether it was because you gave them (the website, not the Wayback Machine) consent, negligence, outdated practices, or simple oversight, the Wayback Machine may have archived it. And once it’s in, removing it is another story. The Internet Archive’s support team is notoriously slow and unresponsive, and when removals do happen, they rarely involve actual deletion. Instead, the data is merely “excluded” from public access, while remaining stored indefinitely in their private archives.
This practice raises serious legal and ethical questions, particularly under the General Data Protection Regulation (GDPR) and similiar laws. By retaining personal data indefinitely without proper erasure, the Internet Archive may be in violation of GDPR Article 5(1)(b),(c), and (e). The organization claims an exemption under the “archiving purposes in the public interest” clause. However, GDPR Article 17(3)(d) makes clear that exemptions apply only where erasure would “render impossible or seriously impair” the purpose of processing, which doing so would not.
Even more concerning is the lack of transparency. There is no practical way for a data subject to know that their information has been processed, what specific data was processed, from what web page, or how long it will be stored. This makes exercising the right to erasure nearly impossible, since most people cannot request removal of data they don’t even know exists in the archive (if it isn’t kinda obvious).
Additionally, the Internet Archive has announced that it will no longer honor robots.txt files when crawling websites. They justifies this by saying that robots.txt was designed for search-engine bots, not archival goals, and that ignoring it helps produce more “complete snapshots” of web pages. What this means in practice is that even sites that explicitly disallow crawling can still get archived by the Wayback Machine. While ignoring robots.txt isn’t explicitly illegal, it shows that the Internet Archive is not operating the Wayback Machine in good faith.
Finally, exemption under the “archiving purposes in the public interest” only applies at all “if the processing is subject to appropriate safeguards for individuals’ rights and freedoms” and “if the processing is not likely to cause substantial damage or substantial distress to an individual”. Given the lack of transparency, the blatant refusal to delete personal data, and the potential for long-term harm to individuals, including children, whose sensitive data is permanently preserved even if they have explicitly stated otherwise, it’s hard to argue that these safeguards have been implemented or applied appropriately.
In conclusion, the Internet Archive remains an important tool for preserving digital history. Yet, without stronger accountability and a fair process for handling personal data, it risks standing on the wrong side of privacy laws and the rights of billions of people.
See more: All potential privacy law violations by the IA in depth
Disclaimer: The above information might be outdated or incorrect, reader discretion is advised, please do your own research.