The Internet Archive: The Double-Edged Sword of Information Freedom and Privacy

The Internet Archive is, let’s be real, a very good website. It offers free access to vast amounts of media, books, software, and cultural artifacts. It also preserves rare and lost content that might otherwise disappear forever. For researchers, historians, and everyday users, it’s an invaluable resource.

But the Internet Archive isn’t just about media preservation, it also operates the Wayback Machine, a powerful tool that automatically crawls websites, capturing and storing snapshots at specific points in time. This is where things get complicated.

If a website ever exposed your personal data publicly, whether it was because you gave them (the website, not the Wayback Machine) consent, negligence, outdated practices, or simple oversight, the Wayback Machine may have archived it. And once it’s in, removing it is another story. The Internet Archive’s support team is notoriously slow and unresponsive, and when removals do happen, they rarely involve actual deletion. Instead, the data is merely “excluded” from public access, while remaining stored indefinitely in their private archives.

This practice raises serious legal and ethical questions, particularly under the General Data Protection Regulation (GDPR) and similiar laws. By retaining personal data indefinitely without proper erasure, the Internet Archive may be in violation of GDPR Article 5(1)(b),(c), and (e). The organization claims an exemption under the “archiving purposes in the public interest” clause. However, GDPR Article 17(3)(d) makes clear that exemptions apply only where erasure would “render impossible or seriously impair” the purpose of processing, which doing so would not.

Even more concerning is the lack of transparency. There is no practical way for a data subject to know that their information has been processed, what specific data was processed, from what web page, or how long it will be stored. This makes exercising the right to erasure nearly impossible, since most people cannot request removal of data they don’t even know exists in the archive (if it isn’t kinda obvious).

Additionally, the Internet Archive has announced that it will no longer honor robots.txt files when crawling websites. They justifies this by saying that robots.txt was designed for search-engine bots, not archival goals, and that ignoring it helps produce more “complete snapshots” of web pages. What this means in practice is that even sites that explicitly disallow crawling can still get archived by the Wayback Machine. While ignoring robots.txt isn’t explicitly illegal, it shows that the Internet Archive is not operating the Wayback Machine in good faith.

Finally, exemption under the “archiving purposes in the public interest” only applies at all “if the processing is subject to appropriate safeguards for individuals’ rights and freedoms” and “if the processing is not likely to cause substantial damage or substantial distress to an individual”. Given the lack of transparency, the blatant refusal to delete personal data, and the potential for long-term harm to individuals, including children, whose sensitive data is permanently preserved even if they have explicitly stated otherwise, it’s hard to argue that these safeguards have been implemented or applied appropriately.

In conclusion, the Internet Archive remains an important tool for preserving digital history. Yet, without stronger accountability and a fair process for handling personal data, it risks standing on the wrong side of privacy laws and the rights of billions of people.

See more: All potential privacy law violations by the IA in depth
Disclaimer: The above information might be outdated or incorrect, reader discretion is advised, please do your own research.

2 Likes

If you have any questions feel free to ask.

1 Like

I agree that there are moral issues here but overall I doubt there’s any legal route to attacking Internet Archive for this.

GDPR is a European Law, IA is a US non-profit that probably doesn’t have any obligation to comply. Although it has to respect California privacy Law.

This is a generalised oversimplification, but privacy becomes a legally dubious argument if it was something in public view.
It’s perfectly legal to photograph and record people out in public on the street.

2 Likes

Only if they can’t be identified, if you’re taking a photo of someone who is widely recognizable or has explicitly said they don’t want to be photographed then you would need to blur their face, same goes for personal information online

1 Like

Near the end of 2024 the Internet Archive was DDoSed which could have brought more awareness to this issue, but unfortunately it was carried out by extremists who did it because “the IA is American”

In August 2025 Reddit spokesperson Tim Rathschmidt has said that the Internet Archive has been unable to “defend their site and comply with platform policies. This includes respecting user privacy by deleting removed content” and have now officially blocked the Wayback Machine from scraping user generated content on Reddit, leaving only the home page where the Wayback Machine can still index news and popular posts. If more social media websites do this in the future it would mean that their users’ right to privacy would no longer be infringed by the Wayback Machine while it could continue to preserve the history of said website.

It might also be because Reddit has an AI partnership with Google.


GDPR doesn’t really apply to publicly available data anyway. And the Right to be Forgotten is more about indexing data than making it available.

I don’t think they index random social media accounts (except public notierity) for example.

Also, IA absolutely has to comply with EU law. Their jurisdiction doesn’t matter here.

GDPR does in fact apply to publicly available information though, as their definition for personal data is “any information relating to an identified or identifiable person” while making personal data available is likely beyond just indexing and there are likely other laws that cover that, though by doing so you would still have to index and store said data, so Article 17 would apply in most cases.

Given that the web page has enough traffic they will archive it, heck, check your privacyguides account, they might have already archived that, and if they haven’t, they will one day. That is the point I’m trying to say, is that the Internet Archive archives random people’s stuff that might contain personal information instead of just the things that have actual historical value.

Very useful article on how to block the IA from your websites and protect the privacy of your users How to Block Wayback Machine from Archiving Your Website - Reputationo

Sorry but I doubt that. At least when it’s for public utility purposes, journalists can absolutely share deleted tweets from someone, for example. I might be wrong, but it would seem extreme for anyone to be able to wipe off the web things they publicly said for public viewing (definition of social media). If you have a court case saying otherwise, I am open to it.

1 Like

There is a huge difference between journalism and archiving though (at least individually vs as an organization), journalists who reference or screenshot deleted material are usually acting under their individual freedom of expression. They do not systematically “process” or “retain” the full dataset, but instead use the information in a specific context (e.g. commentary, reporting, criticism). They aren’t a “data controller” in the sense of the GDPR, they don’t have your data, they just republish it (there might be other laws and ethical issues related to publishing deleted things shared by someone but its not against the GDPR), while the Internet Archive is an organization running a structured, automated system that crawls, stores, indexes, and indefinitely retains personal data.

Deleted Tweets are not PII though. The content you publish online is not covered by the GDPR.

Unless they contain PII, in which case no, a journalist can’t really share a deleted tweet that says like “hey my social security number is 101-12-3456”

1 Like

Totally. That was my point.

So the Web Archive will mostly capture info that you willingly publicized on the Internet.

At least for high-profile cases, they typically remove archives for doxxing websites, such as the “charlie murderers” website.

Under the GDPR the definition of PII is “any piece of information that relates to an identifiable person/individual” meaning that yes, deleted tweets are not PII, but only if the post cannot be linked to the poster, as otherwise it would be PII according to the GDPR. This can also be backed up by the Nowak v Data Protection Commissioner case, which ruled that while exam results, which is not PII, would still be still considered PII if it can be linked to the exam candidate. Content published online would also be covered by the GDPR given that it still falls under that definition.

1 Like

Interestingly, the court case you linked predate GDPR.

It’s true you have a right of deletion of your content, but I wonder how far does it go. The main point is can you force others to also respect it ? Would the Wayback Machine be considered journalism?

Making someone respect your right of deletion is difficult, which is why we have laws like GDPR, but in reality that would still be a problem, as most smaller organizations or ones that don’t operate in EU areas will still ignore valid requests as they are very hard to enforce and they likely won’t be held accountable.

Guide on how to remove something you own or something that contains your personal information from the Internet Archive Reddit - The heart of the internet , though you should only bother doing so if its very sensitive information that should not remain in public, as they will only “exclude” it from the Wayback Machine lookup instead of permanently deleting said information