Is KeePassXC still trustworthy if they are allowing AI generated contributions?

I’m a KeePassXC maintainer and I’ve been following this thread for a while. Thanks for keeping this a mostly grounded discussion (I wouldn’t post here otherwise). I don’t want to add very much. Most arguments have been exchanged, the sides are clear, and we covered a lot in our blog post.

There’s one issue, however, that keeps coming up and that I feel needs a bit more context. That issue is the sycophantic nature of LLMs as a particular danger. We touched on it in the blog post, but there’s obviously a lot more to that. This is not to convince anyone who has made up their mind, but others might find it useful to contextualise the argument.

First of all, I find the sycophancy of LLMs annoying, but I’m not super worried about it for myself. I am worried about people who don’t know that LLMs do that. But if you know and expect it, it’s a lot easier to detect and to handle to the point where you find it just very annoying, because you notice it.

But besides this subjective assessment, the more important point I want to make is that you should distinguish between different kinds of LLMs. They’re not all the same. So where does the sycophancy come from? For all we know, it’s mostly from two sources:

  1. The post-training process (“alignment”). Without post training, LLMs are kind of raw and there isn’t much of a difference between code and text LLMs. There also isn’t anything particularly sycophantic about raw LLMs. They just blab the next likely word in the English language. To make them somewhat useful, pre-trained LLMs are fine-tuned on a small high-quality dataset to specific domains, or there’s some sort of reinforcement learning involved to make them “speak” and follow instructions. The most famous method for this is Reinforcement Learning with Human Feedback (RLHF), in which a reward model is trained on ranked preferences given by humans. This makes LLMs “understand” what the user expects as a response to a prompt. But it also favours “nice looking” and perhaps sycophantic answers.
  2. The system prompt (what LLM vendors put before your prompt). Vendors tell fully-trained LLMs what to do, not to do, and generally how to behave in an invisible prompt that is added before yours (sometimes separated by special tokens you as a user cannot input). For instance, ChatGPT has the utterly annoying habit of acknowledging every follow-up prompt with a statement about how great and to the point that question was. I’m pretty sure that comes mostly from a system prompt telling it to do that.

The final piece is your prompt, and you have actually quite a bit of control here. Not everything in the system prompt can be overridden, but you can tell ChatGPT not to do be so annoying and that will at least tone it down a lot. You also have some control over the RL alignment. After all, LLMs are conditional language models. They will generate the most likely next answer given all the input, and if you prompt them accordingly, you will influence their output distribution. Also, most models will not actively try to sabotage you, unless they’re explicitly aligned to do that (they might do that accidentally sometimes, either by misalignment or incompetence).

But as I said above, not all LLMs are the same (even if they use the same pre-training steps or data), and that is because they can use very different alignment methods. There are many more options to choose from than just RLHF. As far as I know, RLHF is (still) used extensively by OpenAI, and a bit less so by other vendors. Other examples are Direct Preference Optimization (DPO), which cuts out the reward model for more stable results and faster runtime, or Reinforcement Learning from AI Feedback (RLAIF), which doesn’t even use human feedback any more. Another option is Reinforcement Learning with Verifiable Rewards (RLVR), where the rewards are directly grounded in a verifiable ground truth (calculation results, unit test, etc.) and not in how nice an answer looks to a human. There’s nothing sycophantic about that. RLVR is particular interesting for math, code, and logical reasoning models, and it was used extensively by DeepSeek.

We don’t really know what OpenAI are doing behind closed doors, unfortunately, but almost all code generation models today use some sort of objective reward signal in their post-training process and the performance evaluation afterwards; usually in the form of unit tests and build success status for agentic models. This is also about what they claim to have done for GPT-5 Codex. None of that will guarantee correct results (it’s still just statistics), but it makes a big difference whether you optimise for looks or for objective results.

So no, I’m not too concerned about LLMs producing masterly deceptive code that is near-impossible for humans to detect. 1) Because I know the limitations and habits of the models and 2) because it’s not really what they do.

I won’t engage further in the pro and con discussion here. But I’ll keep reading a bit and may answer a technical question if one arises.

10 Likes