Hister: A free & self-hosted personal search engine

I’m working on a self-hosted search service called Hister with the goal to reduce dependence on online search engines. It can provide significantly more privacy than metasearch engines like Searx.

Hister is a full text indexer for websites and local files which automatically saves all the visited pages rendered by your browser. It has a flexible web (and terminal) search interface, offline result previews & query language to explore saved content with ease or quickly fall back to traditional search engines.

I’ve been using it for a few months and as my local index is growing I can avoid opening google/duckduckgo/kagi more and more frequently.

The initial perception is overwhelmingly positive with already more than 25 contributors and hundreds of contributions - perhaps you find it useful as well. (Or at least have some constructive criticism =]) I hope more users and contributors join us to make Hister better and better.

Few useful links:

About me: I develop privacy protecting and data liberating free software since 2008. I’m the author of Searx, Colly (GitHub - gocolly/colly: Elegant Scraper and Crawler Framework for Golang · GitHub) and many more smaller free software/self-hosted projects ( asciimoo (Adam Tauber) · GitHub ).

4 Likes

Excuse us for the delay in publishing this (5days) We are getting more and more posts to review due to many ai slop and unready vibe coded projects (read nicely looking frontend with no functionality) being submitted so it starts to take us a bit longer to review submissions as we don’t want to overload the community with such posts.

4 Likes

This looks very interesting. My first question immediately was why not direct traffic to DuckDuckGo instead of Google?

I second the previous poster. It does indeed look very nice.

Personally my biggest problem with Hitster is that it would require me to have an additional browser extenstion installed which would make my browser even more fingerprintable. Or am I missing something?

No worries, it’s better to wait a bit than having low quality content. Thanks for the thorough review and for keeping this community healthy.

My first question immediately was why not direct traffic to DuckDuckGo instead of Google?

We have no way to prove that DDG does better in terms of privacy. But I can prove that they provide lower quality results. The external search engine is configurable, so anybody can set their preferred one (even SearXNG).

which would make my browser even more fingerprintable

Your browser is probably already incredibly “fingerprintable” if you have JS enabled. Especially if you use Firefox and you have any extension installed. (See this post for details). But even without extensions it’s very easy to track someone having JS enabled. On the other hand, if you don’t have JS enabled, you cannot be tracked by extension IDs. So no need to worry about adding more fingerprintability with installing an extension in either case.

EDIT: Using the extension isn’t mandatory it just makes indexing visited sites convenient. You can use Hister’s command line crawler and the API to populate your index.

The best protection against fingerprinting and tracking is not visiting a website. That’s my main approach with Hister. Sure, it takes time to grow your index and you cannot always avoid using the internet, but the more data you have, the less the exposure to privacy threats.

Please explain yourself here. It seems pretty obvious that DDG has better privacy practices.

1 Like

It isn’t obvious for me at all. The only verifiable difference is that what they say sounds better from a privacy perspective.

If I cannot verify how my data is handled then I consider that service a privacy risk no matter what they are saying. Even if everything is true what DDG claims, their system can get compromised and malicious actors can still steal/log/monitor your data while you are in a false sense of using a “privacy respecting” service.

By this logic of evaluation, you’d have to discount 1Password. And discounting 1Password as a fantastic password manager for all that it is and does would not make sense one bit.

By the same logic, you also can’t be sure your money in your bank is safe and that you operate on the perpetual uncertainty about if your money is there or not every day.

This logic makes no sense. Sorry.

2 Likes

Adding an extra extension still adds some fingerprinting entropy, so “no need to worry” is too strong (we could also argue about supply chain security and trusting another extension etc). Yes, it’s true that JS enabled increases fingerprinting surface a lot, but that does not mean extra extensions stop mattering. Also, the “especially Firefox + any extension installed” part seems misleading as afaik current Firefox uses random moz-extension UUIDs for web-accessible resources specifically to reduce ordinary website fingerprinting of installed extensions.

On the other hand, I agree thatr disabling JS helps but it does not eliminate tracking, and extensions can still affect what servers or sites can see and collect.

Also I never said I was using Firefox.

Maybe I should rephrase my hastily written initial answer as: There is extra risk in using the extension as there is with any extension.

I didn’t mean this to offend you or anything, I still think the project is nice :heart:

Isn’t trust an integral part of the internet and IT in general though? Don’t I trust you and multiple other parties when installing the extension? And I’m not speaking about just trusting your code, I mean trusting the extension really is exactly the same code as in your repo etc.

Also since you’re the author of Searx too: Why should I use this then instead of Google?

In the end the internet is about a lot of trust imo.

Edit: stupid me linked the wrong Firefox article, should be corrected now

Also, if you’re going to go about and making your argument that DDG is not a privacy respecting search engine, I’m inclined to disparage your views in full because it simply ain’t true.

Privacy is a spectrum. What’s private for one may not be private enough for another. Also, this would also mean The Tor Project is wrong in using Duck as their default search engine because Duck is not private and to disagree with them on privacy at-least to the extent of DDG is laughable.

2 Likes

There you can verify if your data is leaving your machine encrypted. I don’t consider a privacy risk of storing and retrieving encrypted data remotely.

This has nothing to do with privacy. Btw, banks are pretty big privacy risk, but this is a totally different topic.

There is nothing wrong with having this opinion, however by the examples you wrote, it seems you’ve misunderstood what I’m saying.

1 Like

How have I misunderstood the statement I quoted in my first comment? It reads pretty clearly. The logic of your statement and my examples are the same. I don’t see a difference.

Saying I am misguided but not explaining/clarifying yourself right after? I don’t see this as a fruitful way to communicate.

Yes, this is where the problem lies. Using random UUIDs makes the extension fingerprinting impossible, but it makes the browser fingerprinting fully possible even without cookies. Nice that the websites can’t query what extensions you use, but by the fully unique extension ID they can identify your browser any time. I think it is a much bigger privacy risk.

How can servers get information about your extensions without using javascript?

I did not consider what you wrote offensive. It’s always good to challenge ideas.

Unfortunately yes, it requires more trust than what should be ideal. I’d still like to force myself to always remember this and minimize it as much as I can do.

This is why free software and self hosting is important. You don’t have to trust a provider, you can verify and build everything locally. Of course, you don’t have to verify everything and self-host, it is totally fine to consider sacrificing privacy for convenience. But, at least if you’d like to min/max privacy, you have the option to do it this way.

You shouldn’t use it. Trusting it blindly can cause more privacy harm than using Google sceptically in some special cases. That’s one of the reason why I stopped developing it long time ago and why I’m developing Hister.

True, but I hope we can minimize the trust required with the right tools and policies.

1 Like

I like the project’s style, and it’s clear it has potential. I don’t see any content overload, but I appreciate its simple yet high-quality interface. Being free and customizable is even better.

One suggestion: Integrate Spanish.

Questions:

1. Do the extension and server have security measures? If so, what are they? If not, do you have plans to implement them?

2. Does the database have automatic or manual maintenance to update content in real time in case of new website updates?

3. Does it have secure RAM optimization where the server runs without breaking content or affecting performance?

That’s all for now.

1 Like

I really like what you’re doing @asciimoo . The premise (and website) looks incredible, interesting, and my only concern out of the gate is data storage size and local data/index encryption. I’ve saved it to investigate it later on for sure!

And don’t mind the negative comments. Keep up the good work!

2 Likes

Thanks for the feedback <3

The indexer is language agnostic, Spanish stemming is available. Interface localization is entirely missing however. It is planned in the future.

Token, password and OAuth/OIDC based authentication is available.

The extension periodically checks if the opened websites got changed and it updates the index, but the database itself does not have any maintenance feature yet. It is definitely a good idea tho. I’ll add it to my TODO list.

Could you explain what secure RAM optimization mean?

1 Like

Currently the size requirements are around 100Mb/1000 websites, but it can be reduced by at least an order of magnitude, because currently I store the whole raw HTML of the websites in the index for development purposes. In the near future I’d like to exclude HTML/favicons from the index which will not only significantly reduce the index size but makes the growth of the index non-linear.

Encryption however an issue I have yet to find a better solution than the default full disk encryption. Let me know if you have suggestions - unfortunately the indexer lib I use does not support index encryption.

1 Like

It is quite simple. We know that Google uses your data and stores your searches to serve you personalized ads and tracks you around the web using their other products. DDG evidently does not do this. Not understanding how this is not obvious to you that without having all proof you can already see it is the better option.

I really think your project is interesting but your take here is making me doubt you understand the topic enough to guard my data.

2 Likes

I stand corrected. I thought they would reset on browser restart but it seems like they persist (per profile).

The browser still transmit a load of metadata and other information. Not using JS also makes you stand out a lot. If you open https://amiunique.org/ for example in a Tor Browser without JS enabled you can still see quite a bit.

Automatic Page Indexing

The extension automatically captures page content every time you visit a URL. It extracts the page title, full text, HTML, and favicon, then sends them to your Hister server via its API.

Hister does not encrypt the history data it stores. This is only a problem if you don’t trust the Hister server your clients are communicating with.

Quite a thing to consider that this will also include all your personal data. So if you host this somewhere centrally all your data will be there. I suppose that is the point but it is creating quite a target.

Personally I wouldn’t use it, if it is not end to end encrypted so that only my own clients could decrypt the data living on the central server.

2 Likes