Is KeePassXC still trustworthy if they are allowing AI generated contributions?

Fermata · November 10, 2025, 9:05pm

I can be the bearer of bad news for Bitwarden. Its cooked as well.
See: AI | Bitwarden Contributing Documentation

There is an alternative called VaultWarden but there is no non-slop coded client available.

hashcatHitman · November 10, 2025, 9:08pm

Thanks. My point in bringing that up was more “you need to provide sources for claims” than “I don’t think what is being said is true”. The entire argument being made has been entirely anecdotal and I expect better from this forum.

Cyber-Typhoon · November 10, 2025, 10:02pm

Shocking!

2025, bru, AI is like the fire weapon invention. There is no go back.

jonah · November 10, 2025, 10:18pm

Personally I don’t believe AI-generated code is less safe in open source than accepting PRs from total strangers, which is obviously well established practice. I don’t like AI for other reasons, but security is not necessarily one of them and this development practice would not make me stop using/recommending KeePassXC at this time.

Why not?

ph00lt0 · November 10, 2025, 10:47pm

I second this. The amount of veey obvious vulnerabilites and other weaknesses i can still spot in generated code is rather high.

Fermata · November 10, 2025, 10:48pm

Why not?

As I have outlined above, AI code has no underlying understanding, mental model or any reference to ground truth. With normal code you can reason through it and try to think about the programmers intent. You can’t do that with AI code because there never was any intent or reasoning it is just the output of a statistical model, something that looks correct and might be by complete happenstance actually be what you wanted.
This lack of understanding is the reason I why consider AI code a glaring security problem.

beantaco · November 10, 2025, 11:12pm

Like what @jonah says, since PRs can be submitted by total strangers, if all PRs are properly reviewed I don’t see an increased security problem by accepting AI code. KeePassXC seems firm about its code review process. The only problem I can foresee is that the code review process fails.

If people are really unhappy about AI code making its way into software projects for security reasons or other reasons, forking the project might be a possibility. KeePassXC could be forked.

I wonder what other security-critical FOSS projects accept AI code. Anyone have information about BSD, Debian, VPN clients, Signal, Tor, Tuta, Proton, cryptocurrency wallets, etc?

@Fermata Your concern seems to be that since AI code has “no underlying understanding, mental model or any reference to ground truth” it cannot be trusted, even if all PRs go through KeePassXC’s code review process. I would think that part of the purpose of the code review process is to check the code reflects these understandings, but can you explain (with some examples or other details) how that is not true? Which part(s) of KeePassXC’s blog post made you convinced to stop using KeePassXC?

hashcatHitman · November 11, 2025, 3:18am

Largely as Fermata put it; we have much more experience trying to reason about the actions of other people than we have trying to reason about AI output. AI output is typically sycophantic, and there’s a huge difference between trying to spot mistakes in the code of someone with no experience (very new contributors), trying to spot mistakes in the code of someone with some experience (“typical” contributors), trying to spot mistakes in the code of someone with a lot of experience (long-time contributors, maintainers), and trying to spot mistakes in the code generated from a statistical model trained on decades of code of all kinds which has a tendency to provide output that “looks” as correct as possible. “Looking” correct, being correct, and being someone’s “best attempt” at correct are different things - with experience, you can spot mistakes based on misunderstandings or gaps in knowledge which you’ve seen or made in the past, likely because of similar reasoning. No such reasoning exists in AI. Furthermore, when human contributors are wrong, they can be taught to be better. It won’t necessarily happen, but it can, because they have some understanding of the things they do. People learn.

To be clear, if you want to keep recommending KeePassXC, that’s fine by me. I’m not here to argue otherwise. In a world with infinite maintainer time and a strong review process, it very well may be as secure as typical contributors. I disagree with the notion that it is “just a tool” or that it can be used ethically; that is my disagreement. The harms which have made themselves apparent to me are too great to think otherwise. I’ve seen firsthand how AI destroys people’s ability to think for themselves in both my colleagues and my family. I also think writing code is an art form, and that we should write code “by people for people”.

overdrawn98901 · November 11, 2025, 3:26am

The act of writing code is but one step of the entire software development lifecycle, and having a degree of semi-automation to one step actually highlights the strengths of the other processes: review, testing, project management, and technical vision. If the only reason a FOSS project is good is due to one or two really good external developers making up for a weak project lead, it’s a project on shaky ground even before AI usage. A strong FOSS maintained will reject PRs that don’t meet a quality standard, regardless of how the code was written.

kissu · November 11, 2025, 7:00am

To @Fermata, I clearly understood that I cannot win you over and that you made up your mind long time ago. You marked this question as solved, hence I’m done arguing.

Didn’t meant to have a condescending tone, English is far from being my strongest language so sorry if my tone wasn’t kind enough.
We agree on quite a few points anyway, it’s just the way to get there or the fact that there is no clear path to victory for a healthy ecosystem where quality code is valued enough and paid for.

To @hashcatHitman, I do not master any fancy words or have a grasp of all the fallacies listed above.
At the same time, it is not worth my time to go dig deep some drama from companies or even care about how bad things are nowadays, I know enough and don’t want to list my sources as if I’m writing a thesis.
Maybe it’s just my own feed/bubble, but I know where I stand and fine with it.
I’d rather use my time, money and energy on spreading some positivity on a scale other than 2 users on a forum.

I also won’t benefit from “winning” here either, you’re an anon random person on the Internet that I will probably never meet in person. I also tend to opt out from AI/software politics, not a war worth winning over, at all.

I anyway do also agree with you on the state of AI in the software industry, just don’t want to bikeshed around semantics.
Wishing you a nice day ahead.

I mostly fell into my usual trap of being a bit too personally invested into a topic that I do care about: quality software by a skilled small team.
I do like KeepassXC a lot and felt righteous trying to defend it I guess. But @jonah is not removing this from the recommendations so I have nothing to worry about.

Even if a tool is not recommended here, I’d willingly make my own judgement and use it myself (an example could be Immich ).
I finally understood what good quality software means after a few years.

hakavlad · November 11, 2025, 3:55pm

In any case, people can submit pull requests with code written by AI without saying that the code was written by AI. You can’t go back to 2020 when code wasn’t written by AI. What matters is how the code is reviewed and tested, not who wrote it.

phoerious · November 11, 2025, 4:30pm

I’m a KeePassXC maintainer and I’ve been following this thread for a while. Thanks for keeping this a mostly grounded discussion (I wouldn’t post here otherwise). I don’t want to add very much. Most arguments have been exchanged, the sides are clear, and we covered a lot in our blog post.

There’s one issue, however, that keeps coming up and that I feel needs a bit more context. That issue is the sycophantic nature of LLMs as a particular danger. We touched on it in the blog post, but there’s obviously a lot more to that. This is not to convince anyone who has made up their mind, but others might find it useful to contextualise the argument.

First of all, I find the sycophancy of LLMs annoying, but I’m not super worried about it for myself. I am worried about people who don’t know that LLMs do that. But if you know and expect it, it’s a lot easier to detect and to handle to the point where you find it just very annoying, because you notice it.

But besides this subjective assessment, the more important point I want to make is that you should distinguish between different kinds of LLMs. They’re not all the same. So where does the sycophancy come from? For all we know, it’s mostly from two sources:

The post-training process (“alignment”). Without post training, LLMs are kind of raw and there isn’t much of a difference between code and text LLMs. There also isn’t anything particularly sycophantic about raw LLMs. They just blab the next likely word in the English language. To make them somewhat useful, pre-trained LLMs are fine-tuned on a small high-quality dataset to specific domains, or there’s some sort of reinforcement learning involved to make them “speak” and follow instructions. The most famous method for this is Reinforcement Learning with Human Feedback (RLHF), in which a reward model is trained on ranked preferences given by humans. This makes LLMs “understand” what the user expects as a response to a prompt. But it also favours “nice looking” and perhaps sycophantic answers.
The system prompt (what LLM vendors put before your prompt). Vendors tell fully-trained LLMs what to do, not to do, and generally how to behave in an invisible prompt that is added before yours (sometimes separated by special tokens you as a user cannot input). For instance, ChatGPT has the utterly annoying habit of acknowledging every follow-up prompt with a statement about how great and to the point that question was. I’m pretty sure that comes mostly from a system prompt telling it to do that.

The final piece is your prompt, and you have actually quite a bit of control here. Not everything in the system prompt can be overridden, but you can tell ChatGPT not to do be so annoying and that will at least tone it down a lot. You also have some control over the RL alignment. After all, LLMs are conditional language models. They will generate the most likely next answer given all the input, and if you prompt them accordingly, you will influence their output distribution. Also, most models will not actively try to sabotage you, unless they’re explicitly aligned to do that (they might do that accidentally sometimes, either by misalignment or incompetence).

But as I said above, not all LLMs are the same (even if they use the same pre-training steps or data), and that is because they can use very different alignment methods. There are many more options to choose from than just RLHF. As far as I know, RLHF is (still) used extensively by OpenAI, and a bit less so by other vendors. Other examples are Direct Preference Optimization (DPO), which cuts out the reward model for more stable results and faster runtime, or Reinforcement Learning from AI Feedback (RLAIF), which doesn’t even use human feedback any more. Another option is Reinforcement Learning with Verifiable Rewards (RLVR), where the rewards are directly grounded in a verifiable ground truth (calculation results, unit test, etc.) and not in how nice an answer looks to a human. There’s nothing sycophantic about that. RLVR is particular interesting for math, code, and logical reasoning models, and it was used extensively by DeepSeek.

We don’t really know what OpenAI are doing behind closed doors, unfortunately, but almost all code generation models today use some sort of objective reward signal in their post-training process and the performance evaluation afterwards; usually in the form of unit tests and build success status for agentic models. This is also about what they claim to have done for GPT-5 Codex. None of that will guarantee correct results (it’s still just statistics), but it makes a big difference whether you optimise for looks or for objective results.

So no, I’m not too concerned about LLMs producing masterly deceptive code that is near-impossible for humans to detect. 1) Because I know the limitations and habits of the models and 2) because it’s not really what they do.

I won’t engage further in the pro and con discussion here. But I’ll keep reading a bit and may answer a technical question if one arises.

hashcatHitman · November 11, 2025, 5:08pm

@kissu:
I don’t expect you to write a thesis, but it is important to me for conversations on this forum to be clearly composed of evidence-based claims or clearly stated opinions. If something is written in a way that appears to me to be asserting some claim (rather than stating an opinion), I expect at least one source. It doesn’t need to be a research paper. I just want to know what you’re basing your claims off of, so at the very least, I (and others) have something I can use to consider for myself whether to take them as true or not. If you’re making a claim that is false, knowing where you got it from can help clear up misunderstandings as well (if you cite The Onion, you either aren’t taking the conversation seriously, or you don’t realize it’s satire, for example).

It might be the case that you weren’t trying to make an assertion of a claim and I misunderstood you due to a language barrier. Or I misunderstood for entirely different reasons. Or I didn’t misunderstand. Whatever the case, I don’t care about “winning” here either. If you’re content with where you stand as a matter of opinion, that isn’t any of my concern. Though I do think it’s important to try not to grow content in our own bubbles; that sort of apathy allows circumstances to change for the worse slowly beneath our noses, and I’d argue it plays a big part in the privacy issues we face today.

Nevertheless, thanks for keeping it respectful. I also wish you a nice day (this feels like an awkward way to respond to that, but I think “good day to you too” or something like that would come off as a weird kind of disrespectful???).

hashcatHitman · November 11, 2025, 5:59pm

(I know this isn’t exactly directed at me, but I want to respond to it anyway)

As I said previously, I’m not arguing specifically with regard to KeePassXC. I hadn’t read the blog post until just now because it wasn’t particularly relevant to the point I was trying to make, independent of KeePassXC. I have now read it to better contextualize your post.

I do agree that the sycophancy concern is much less concerning if you’re conscious of it, so with respect to KeePassXC specifically, it is reassuring, in my opinion, to know that this isn’t news to you.

I was already aware of alignment, RLHF, RLAIF, and system prompts. DPO and RLVR are news to me though, so I appreciate you mentioning them. I’ll try to look into them more when I can, but if you know any good resources, it’d be appreciated. I don’t see any specific mention of “RLVR” in the link you provided, but I’m assuming that’s what “iteratively run tests until passing results are achieved” is supposed to mean?

Obviously, this doesn’t really change how I feel about this as a matter of ethics or the “by people for people” side of things. But I also recognize that “the genie is out of the bottle” and there is almost no chance of entirely removing AI from the picture. With that in mind, with respect to KeePassXC and the blog post, I do think that, assuming all claims about everything still being manually reviewed are true, KeePassXC is probably taking about the second best approach I could ask for. I’m always going to be an advocate for “no AI”, but failing that, explicit approval with the requirement of disclosure is probably better than saying nothing at all. I’m also definitely more “okay with” using it to perform additional code reviews (not as a replacement for the normal ones) as the blog post mentions, since I think it’s a lot easier to identify when a complaint it makes is invalid (and thus, it’s more likely to be beneficial and help identify issues).

With respect to “creating pull requests that solve simple and focused issues, add boilerplate code and test cases”, though, I have to ask: where does the line get drawn for when a “simple issue” should be offloaded to AI? In my mind, there’s a particular risk for open source projects there, in the sense that I think having “easy wins” available is an important part of being able to bring in new contributors. It’s a way for them to get familiar with the process and push past doubts, and if these kinds of small issues aren’t in supply, I do think it could hurt the project in the long term (notably, KeePassXC has only ever had 30 “good first issues” as of writing this, all of which have been resolved).

phoerious · November 11, 2025, 6:59pm

I don’t see any specific mention of “RLVR” in the link you provided, but I’m assuming that’s what “iteratively run tests until passing results are achieved” is supposed to mean?

No, OpenAI don’t mention anything about it, that’s kind of their M.O. (sadly). But the phrase reads like they did at least something similar to it. You can find several resources on the web. If you’re more into primary literature, then this would be it: https://arxiv.org/pdf/2411.15124

There’s also a further analysis about its efficacy here (also just a preprint, haven’t read it too deeply, so handle with some scepticism): https://arxiv.org/pdf/2506.14245. Apparently, RLVR is also equivalent to contrastive loss at least according to this: arxiv[.]org/pdf/2503.06639 (ugh, cannot post more than two links yet)

It’s not the one and only thing that everybody does (it’s hard to know what anybody does, really), but code models are benchmarked on datasets with verifiable solutions, so it makes sense to train them like that.

I have to ask: where does the line get drawn for when a “simple issue” should be offloaded to AI?

That’s as much a value statement as anything. You could similarly ask whether we should accept this or that pull request. When is the quality adequate enough? How much does it need to be changed to be passable? There’s no general answer to this. With regard to our own use, I would say that at some point the practical quality will just be too low and we’d be spending more time fixing the damn thing than if we’d written in ourselves. That would be a point where AI would be relegated to just adding opinions or suggestions or not used at all. I personally use AI very rarely for code. But it’s a nice learning tool if you’re trying something new, and it helps with boilerplate tasks.

What other people will submit is outside our control, here we’ll just have to judge the contribution by its merits like any other PR. If it’s an enormous diff, we’d probably reject it even if it were of decent quality and fully human written. Better do things iteratively in that case. We also don’t easily change the critical core components without reason and particular scrutiny. That applies to both human and AI contributions.

Good point about the easy issues. We should pick out a few fresh ones.

kissu · November 11, 2025, 7:54pm

First off, thanks a lot for the hard work of the entire team of making such a nice product and being honest with your users over the years. Wish you all a long and sustainable future.

I personally use AI very rarely for code. But it’s a nice learning tool if you’re trying something new, and it helps with boilerplate tasks.

Oh wow, I expected you to at least use it for:

single line code autocompletion
automated code review (àla CodeRabbit)
helping with Unit Tests or lower-level complexity tasks

It’s indeed very reassuring to have more details!
Thanks for joining the forum and being available here in case folks do have extra questions.

phoerious · November 11, 2025, 8:17pm

Yes, I do have some auto-complete++ in my IDE. But that’s a small local model. It’s helpful, but it also gets a lot of stuff wrong, which annoys me a bit. Not at all comparable to actual agentic coding models.

We use Copilot for automatic reviews, but that’s nothing I use personally for other projects at the moment.

Unit tests is kind of what I meant with boilerplate code. But I still have to check that the tests actually test what they’re supposed to test, obviously.

hashcatHitman · November 13, 2025, 12:11am

Thanks! These look like exactly the kind of papers I would want to find.

Yeah, I think it might have been a poor question. I was trying to get a feel for how much it might end up hurting the availablility of the aforementioned “easy wins”.

Ultimately, this is the solution. As long as there are a good handful available to bring people in, it doesn’t matter too much if a few end up being solved by AI or existing contributors (though I’m definitely of the opinion that these issues should be reserved for new contributors when they’re of a nature that can be “put off” with little to no real consequence… at least for a decent amount of time).

Thanks for your response!

phoerious · November 13, 2025, 8:41am

I would separate those a bit. Not every easy issue is a good first issue. Good first issues are usually new features that are nice to have, but non-essential. But there are also a lot of simple issues that are bugs that just need to be fixed, and for which we can’t wait until maybe someone comes by. Many contributors want to have certain bugs fixed that affect themselves. People who are just looking for something to do exist, but are rare.

There are also many issues that are very difficult to identify and that need deep knowledge of the code base to find, but the actual fix is very simple. Also, in many cases, it’s way faster to just write the fix than to prompt an AI to do it, because the issue is too simple. There needs to be a minimum level of complexity for it to be a worthwhile tool, but not so much that you spend more time correcting the errors than if you had done it yourself.

hakavlad · November 13, 2025, 10:29am

As is the contribution of any developer.
Even if a company claims that it NOT uses code generated by LLM, you cannot verify this. Even if pull requests are sent by people, it could be code generated by LLM. The KeePassXC code cannot be considered less trustworthy than the code of any other company, even one that denies using LLM.

Topic		Replies	Views
Is KeePassXC accurate calculating entropy of a password? Questions	18	1397	December 4, 2025
KeePassXC added Passkeys support General	13	1141	March 12, 2024
Private online AI (Venice, Duck, etc) General	33	5119	October 5, 2025
I find the KeepassXC autotype feature pretty dangerous if you make a mistake Questions	2	527	September 30, 2025
KeePassXC Awarded ANSSI Security Visa News	4	823	November 26, 2025

Is KeePassXC still trustworthy if they are allowing AI generated contributions?

Related topics