Large language model (ChatGPT/CoPilot-like) search engine that respects privacy

Looking for the Duckduckgo of LLM’s. An online site where one can get a service like ChatGPT or similar that is not collecting metadata or logging IP or other stuff.
I do NOT want to host myself.
And not just an AI summary of the most relevant hits, like Brave’s search page does. I want the full experience, like discussing code-writing over multiple interactions.

Not really a way of doing that privately I think. DDG lets you disable JavaScript and use Tor without giving you 50 CAPTCHAs, so a version of that would have to be in the AI thing. I’m not sure if there’s an issue with AI services giving a lot of CAPTCHAs or not.

Probably what I’d look for is:

  • No CAPTCHAs
  • No accounts
  • Works without JS
  • Has a .onion site

Hmm, should be doable? AI-model on server 1, and me interacting with server 2 containing just an UI that proxies to server 1? So if AI-model needs some bad code, server 2 can have only privacy respecting stuff. or?

For maximum privacy, can’t beat locally hosting an open source model on your own hardware. With that said,

What about… Duckduckgo :slight_smile: (duck.ai) it seems like nothing could be more duckduckgo-like than the Duck itself.

It would meet all of Fria’s criteria:

Probably what I’d look for is:

No [excessive] CAPTCHAs
No accounts
Works without JS
Has a .onion site

You are still placing trust in DDG (as you are when you use DDG search), as well as some limited trust in the model hosting providers adhering to DDG’s terms.

If not Duck.ai maybe something like:

Groq

We do not retain your data

Prompts and context provided via our LLM APIs are not retained.

We do not train on your data

We specialize in fast inference.

Custom models are your own

We leverage best-practice security principles to ensure that we protect models that we host.

or Huggingchat

We endorse Privacy by Design. As such, your conversations are private to you and will not be shared with anyone, including model authors, for any purpose, including for research or model training purposes.

You conversation data will only be stored to let you access past conversations. You can click on the Delete icon to delete any past conversation at any moment.

1 Like

I enjoy duck.ai but I would caution that it can only provide information based on the data it was trained on, which goes up until October 2021. Which can lead to some outdated responses.

Not true. It depends on the model you use.

https://search.brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/

can you provide a source? I don’t know of any other duck models that use a different dataset

DuckDuckGo doesn’t make their own models, they use models from Meta (Llama), Mistral, OpenAI, etc.

Source: their terms and conditions and privacy policy and their product page

Yes you can use different models but they still use the same duck dataset from my understanding.

EDIT:

looks like Llama goes to Dec 2023 and o3-mini goes to October 2023 - the issue i mentioned originally is still the same though.

Duck.AI uses 5 different models from 4 different companies (Mistral, Meta, OpenAI, Anthropic)

Yes you can use different models but they still use the same duck dataset from my understanding.

What is a “duck dataset?” Duckduckgo doesn’t Train these models afaik, they are using models trained by others.

This is my lack of understanding the terminology. But if you ask any of the models they will say something akin to

DuckDuckGo’s AI model primarily utilizes OpenAI’s language models, which are trained on a diverse dataset up to October 2023.

As @Encounter5729 correctly pointed out, the cut off date changes depending on the model but it still does not change the fact that answers can be 2-4 years out of date.

Just a heads up, it is against the forum rules to present information you got from AI without making a clear disclaimer (and even then still discouraged). Please try to only post your own knowledge or opinions.

In this case that information is leading you to a wrong understanding (that Duckduckgo is involved in training the models or that there is a “duck dataset”) The training cutoff dates you are seeing are the training cutoff dates of the models*, which is when the model makers (OpenAI, Anthropic, Meta, Mistral) stopped training the model, training cutoff dates are not a Duckduckgo specific limit.

You are right though, that models that are not given access to the web or another way to source more current info will always have somewhat outdated info for knowledge questions.

1 Like

My bad, this was a bit odd of a situation because in this particular case I would assume that duck.ai is going to be factual with this answer and the best source for information on itself.

Would it be fair to say I came to the right conclusion but using the wrong results?


EDIT:

I didn’t mean to veer this off topic or create misinformation. I only brought it up as I thought its relevant to OPs need since it seems like using models that do not have access to the web and may be outdated in their knowledge could be an issue if you want help with code-writing or other topics that change quickly over time.

1 Like

Would it be fair to say I came to the right conclusion but using the wrong results?

Mostly yes, I think it would be fair to say your broader impression (that training cutoff dates = less current knowledge) is correct, but some of your specific conclusions/assumptions were incorrect.

  1. You are correct in that models have training cutoff dates, beyond which they cannot know newer info, unless they are either provided access to the web in some way or some other means. And that means that models that are not given access to that real time knowledge will always be working off of outdated info.
  2. You were incorrect in the assumption that that has anything to do with Duckduckgo, or a “duck dataset”. It is the model makers (e.g. Mistral or OpenAI) that train the models. Duckduckgo’s role is acting as an intermediary and providing the frontend that lets you interact with these models.

As a simple practical takeaway though, the relevant part to OP’s question is that the model’s do not have access to real time data, and cannot reliably answer queries past their training cutoff date. And that this applies to any model without access to real time data.

Semi off topic, but important for effective/responsible use of AI

My bad, this was a bit odd of a situation because in this particular case I would assume that duck.ai is going to be factual with this answer and the best source for information on itself.

That’s an assumption that is reasonable on its face but is actually a very bad assumption. Paradoxically, one of the areas that models are least reliable, is self-knowledge. (it’s not uncommon for models to not even know what model they are (many models think they are either GPT-3 or Llama). You can test this with Duck.ai right now (Mistral Small ““believes”” ““itself”” to be a version of Mistral 8x7b, and GPT-4o ““believes”” ““itself”” to be GPT-3, and Llama 3.3 reports that it is “a model with no specific name”). I think they are often more accurate with respect to their training cutoff date, but I could be wrong, and I didn’t verify the accuracy of what the models reported.

Also, just as a generalization, never take factual statements from LLMs at face value. They are often right, but they sound just as confident and authoritative when they are dead wrong or hallucinating. This is why it’s important not to repeat an AI output without a disclaimer that it’s an AI output that you haven’t personally verified, and not to base your understandings of something important to you or others on AI alone.

Personally, I apply the same approach and skepticism that I’d apply towards confident sounding redditors. I learn a lot from confident sounding redditors, much of what they confidently proclaim is correct or insightful, but some of it is wrong, and in both cases its stated confidently and authoritatively in a way that makes it “sound right.” AI is much the same. I think its prudent to think of AI outputs or Reddit comments as “assertions/claims” rather than “facts”, and to make clear if you repeat them who/what the source is.

In this case, the model’s don’t appear to be to blame, at least in my testing they don’t claim to be trained by Duckduckgo, I think you may have just assumed or misread that part. They state cutoff dates but do not mention DDG.

TL;DR don’t trust AI with “self-knowledge” in particular, and be cautious/skeptical of AI answers to general knowledge queries in general, as AI doesn’t really have a concept of factuality.


Some of the locally hosted frontends do allow giving any model of your choice access to search/web content. Openwebui is one example. While most people using Openwebui are self-hosting the model as well, that isn’t mandatory, Openwebui could also be used as a frontend for whatever API you want to connect it to if its a compatible API.

2 Likes

I appreciate the thought out response. I don’t typically use AI without saying so and regret doing that in this instance. I am glad you pointed it out and corrected me.

2 Likes

Thank you all! It seems like I did not do proper research before asking but hopefully this thread can be helpful for others as well. Both Brave and Duck had the exact thing I was looking for, I just did not know it was interactive, I just thought the AI part was a summary of the top results and not more. And personally I don’t need total anonymity, I just dont like logging of queries and IP addresses, if I should need that I would just use TOR browser.

Just what I was looking for! thanks!

1 Like

I’m using Brave daily and still missed it had this function, Last time I checked is was just AI summary, and I figured it still was, but now it has Co-pilot like functions so this is great news :slightly_smiling_face:

1 Like

If you will be using this for coding, one thing to keep in mind is that the free+private options (and even the free + non-private options) will be limited in some ways. They aren’t generally a huge deal for general knowledge queries or short convo’s but might be an issue for large coding projects or analyzing large text documents.

Brave for example has some pretty strict rate limits on how many queries you can ask before you must subscribe to a paid plan, and DDG doesn’t limit you in that way but has a max prompt length that may be a limitation for coding or processing large blocks of text (limit is 16K characters irrc).


Last thing I'll say on the reliability of models' self-knowledge

Paradoxically, one of the areas that models are least reliable, is self-knowledge. (it’s not uncommon for models to not even know what model they are (many models think they are either GPT-3 or Llama).

@anonymous326 To really drive that point home, read through the short exchange below, and it should become clear just how unreliable a models “self”-awareness is.

Over just a short exchange, the model confidently gives 3 different cutoff dates, and only arrives at the correct one after I led it to the conclusion that it is in fact based on GPT-4 not GPT-3 as it initially believes. I asked a control question at the end (about an event that happened after the GPT-3.5 cutoff but before the GPT-4 cutoff, and it answered correctly. Despite all that, it reverted to believing it’s cutoff was in September 2021 despite having just answered a question about an event in April 2022.

TL;DR Despite the model initially believing it’s knowledge cutoff is 2021, it was able to accurately answer a question about a current event which occured in 2022, but first the model had to be led into understanding it wasn’t GPT-3 and it’s cutoff date couldn’t be 2021:

1 Like