StartPage has apparently started to fingerprint users

Hi from Startpage:

One of our users alerted us to this thread on Friday through our feedback form and we sent over some notes to Jonah & Niek. Reposting them with additional detail.

The most frequent complaint we receive from users, especially privacy-minded ones, is that we block or captcha too many real users, and that the #1 thing we could do to improve their experience is to fix this.

Every day we see millions of bots attempting to crawl our site (even more now that people are trying to train AI models), and we often are subject to DDoS attacks as well. This results in massive expense and risk. We have always had some bot detection in place, but especially in a space where we don’t have any idea if someone is a new or returning user (since we don’t track IP or drop cookies), real and fraudulent users can look very similar (e.g. those using a free VPN).

In response to this user input, we have begun to implement more sophisticated methodology for bot detection that still honors our privacy policy (which is a very hard problem to solve). Historically we have only had a few signals to determine botlike activity, like country or user agent. Now we are exploring using client-side data to improve the precision of these determinations. We have access to known bot patterns that we are comparing to client information in real time, in order to determine if the current search is being executed by a bot.

Some things we aren’t doing:

  • saving or sharing PII including IP address
  • storing the search query
  • associating client information with PII data or session information
  • saving client information to be used for any purpose other than bot detection
  • loading any 3rd party assets

As we explore these detection tools we’re trying to find the right balance of signals to perform an effective analysis without over-collecting. For example, we released a handful of signals on the 16th, but rolled some back on the 23rd after they were determined to be unhelpful for the context in which they had been deployed. Obviously from a privacy perspective we would prefer not to need any client signals at all, but on the other hand, we have received thousands of notes from users arguing that constantly needing to solve captchas or reaching out to us to be unblocked also exposes them to additional scrutiny and undermines their privacy.

Noting that we have an extremely small team and may not monitor this forum on a regular basis, but if there are follow-ups or ideas feel free to reach out to our Support team.

10 Likes