Hister: A free & self-hosted personal search engine

Forgive me if I am wrong and that is what you were trying to say @asciimoo, but I don’t believe that he has said DuckDuckGo is doing these things at all. All I have seen him say is that using DuckDuckGo requires trust, same as Google or most other services, and he wants to eliminate that element of trust. I think that is a commendable goal.

Putting your faith in a service simply on the basis of it being included in another respected privacy tool is ridiculous. Any faith you have in a service should be by its own merit imo.

Cool project OP. I’m excited to see where this goes.

4 Likes

This isn’t a proof for they are not storing or selling your data. This only means that they don’t use it for serving personalized ads.

The only way to verify if they are truly privacy respecting is inspecting how their internal system works. I can’t do it, so I have no tools to label them as a privacy protecting service.

No, the lack of personalized advertisements is not a proof of being privacy respecting. It only means they don’t sell data this way, nothing else.

It’s unfortunate that not believing a profit oriented corporate entity’s words without having the ability to verify their statements makes you think that my knowledge about privacy is limited.

2 Likes

even if they did they would have far less data than google has. It hurting my brain that you don’t understand this. You lost my trust here.

2 Likes

This is ridiculous. I’m starting to think that this is intentional rage bait. =)

Why having less data means that a service is superior in privacy?

There are much-much worse things that personal/search data can be used for than displaying personalized advertisements.

1 Like

Nah, you are spreading FUD about a privacy friendly search engine and don’t seem to understand that they don’t sell data in the first place but if they did they would not even have as much of a bad impact on privacy as google to which you default.

1 Like

You’re welcome, and thank you for your response.

I recommend implementing stronger security measures. For example, if I use my server on my personal Android device, and my phone is stolen, even if the thieves bypass the main security layer, they won’t be able to access or extract the personal work I’ve done with the service.

I encourage you to continue with the project.

Regarding RAM in safe mode, I mean ensuring that optimization doesn’t break anything—whether it’s automatic or manual—for example, by handling words on a page, unexpected errors, etc.

The documentation mentions memory, but I don’t see anything about safe mode. Perhaps you could combine the current options into a single, separate function.

This gave me another idea: instead of just RAM, expand the functionality to include more options in a single, one-click or automatic function.

Thanks for the explanation. I fully agree on prioritizing stronger security measures. I’ve added your suggestions to my TODO list.

4 Likes

I agree, this is what I understood from Hister and to my understanding it is true. Search Engines are inherently problematic for privacy because you need to send the plaintext query/content to a third-party for it to be processed and return results. You can mitigate some of the metadata and fingerprinting problems (non-captcha’d Tor/VPN usage, JavaScript-free support, proxying etc.) but you cannot get around the original problem.

4 Likes

New Hister release: v0.14.0 with Mastodon support. Update your instances.

4 Likes

I think this is an interesting idea but is there a way to not index everything? It seems tedious to have to go and delete multiple pages that may have been clicked but found to be AI slop or otherwise.

Basically, can there be an “Add to Hister” button to build to make it more relevant.

You can disable automatic indexing and use only the “(Re)index page” button on the extension to add sites manually. Also, you can create domain/URL based rules to exclude content from your index.

I did see this but as a user, I would rather use terms that aren’t like “re-index” as if I disable automatic indexing, it never indexed it in the first place.

Additionally, while I can exclude URLs, I do that at the search engine level (Kagi) so I wouldn’t do it here and it would be tedious to then have to go and remember that “I-Love-AI-Slop.org” is not a valid page.

I’d rather create my own custom repository for searching versus the vacuum approach if that makes sense.

Very interesting, and I like the terminal interface. Maybe I’m not the intended user though because I have a hard time seeing how this would be useful to me. Almost always when I’m searching the web it’s because I’m looking for new info/websites. Not very often do I web search to lookup the same content again, or I just bookmark if it’s something I need continual access to.

1 Like

Yeah, that’s true, it should be called “index” in this case.

Additionally, while I can exclude URLs, I do that at the search engine level (Kagi) so I wouldn’t do it here and it would be tedious to then have to go and remember that “I-Love-AI-Slop.org” is not a valid page.

I’m not sure if I understand this. Could you explain? Domain/URL based rules are search engine level rules.

I just found about this project. I haven’t tried it yet, so sorry if I’m saying anything stupid, but as I was reading about it, one thing my mind went to: would it make sense to build some sort of opt-in federation/publication of indices to crowdsource a decentralised index?

2 Likes

It would absolutely make sense and it is on my future plans alongside creating downloadable smaller pre-indexed datasets to allow users to quickly extend their local index/database.

I’d appreciate any kind of help designing/implementing these features even by describing use-cases or requirements.

2 Likes

That’s great to hear. I’m a software developer, so I might be able to help out, although I don’t have that much time available…

I do think it’s a hard problem to solve. Of the top of my head, I can see several challenges like index size, risk of people sharing indexed private o personal data without realising, bad actors sharing “poisoned” data linking to fraudulent sites or people submitting indexed illegal content such as CSAM. Then if the index is hosted by a third party, there’s the issue of how to avoid the index server from tracking the search queries of people.

For the index size, I guess it could be broken down as you mention, maybe by language and category. For preventing the index containing personal data, the indexer should probably visit the sites without any cookie or auth header, but even then, there’s the risk of indexing pages that are publicly accessible, but have a hard to guess URLs and aren’t meant to be indexed or found by crawlers, like Google Docs etc. Maybe reading the robots.txt would be enough to avoid indexing pages that aren’t meant to be indexed? For the “poisoned” or illegal content… that’s probably the hardest issue and as long as random people can submit content, the problem will always be there. Maybe some sort of key-pair signing, where the nodes hosting the index can choose which public-key’s contributions to accept, and index moderation is done by the nodes hosting the indices. Which as I’m typing this, it kind of rhymes with NOSTR. Maybe that could be the decentralised index distribution system. For anonymous queries, people can rely on Tor for either downloading the index from the nodes/relays and then do offline queries, or Tor to actually query the remote hosted indices.

1 Like

You’ve already helped a lot. Thanks! <3

Exactly! These are all valid concerns without trivial solutions.

I think, it would be better to continue creating the specifications either on Hister’s Github or Codeberg repo to allow other developers to join the brainstorming. I’ll create an issue/discussion for both of these improvement suggestions.

Feel free to join us on the development platforms and on IRC/Discord to continue this discussion.

2 Likes

@asciimoo

Question: Are you thinking of creating an Android app (APK)?
So you don’t have to rely on external apps.

I have not thought about it yet, but I’m definitely not against it. Do you have any specific ideas about what features the app should have?