Lmarena, Artificial Analysis

anon11259538 · April 26, 2025, 8:25pm

Do you trust benchmarks like lmarena, Artificial Analysis, and how should one use them as a guide? How do you determine which AI to use for a specific task, and how do you choose AI for different tasks? Do you read posts or watch expert videos? I would appreciate any recommendations or resources.

anon11259538 · April 26, 2025, 8:30pm

For example, do you trust these results?

https://lmarena.ai/

https://beta.lmarena.ai/leaderboard

Sectional2932 · April 28, 2025, 6:10pm

Those benchmarks don’t have anything to do with privacy, as far as I can tell, so I wouldn’t pay any attention to it. Please correct me if I’m wrong.

If I use an AI at work, I have to use one through a browser because I’m not allowed to install Ollama. So I don’t think about it too much there.

But at home I use the Ollama app because it enables AI locally, which means more privacy.

As far as choosing particular models, I’m still trying out different ones. You can see how many are available here: Ollama Search

So far I can’t say that any of the ones I’ve tried is a clear winner. All AI’s are fairly stupid in their own way. But I’ll keep testing them one by one and maybe I’ll find one which I keep. At some point I’ll probably post about my results here on the forum.

anon80779245 · April 29, 2025, 1:34pm

Lmarena has essentially been gamed. I am looking at benchmark analysis capabilities I care about, mainly math and code. AiderChat, MathArena are two good benchmarks.

mangomango · April 30, 2025, 2:23pm

Wdym ?

xe3 · April 30, 2025, 8:45pm

I think possibly they are referring to the reasonable probability or plausibility that model makers are training models (or specific versions of models) to score highly in LMarena. (“teaching the test” essentially, and gaming human nature via flattery/sycophancy)

Hard Fork (podcast) touched on this a few weeks ago. The relevant segment starts at the 50min mark.

anon80779245 · May 1, 2025, 7:13am

Also, a study has shown that providers can test many private models and choose only the best one to display, closed-source models are served more often than open-source/open-weights models, open-source/open-weights models are retired earlier than their closed-source counterparts, etc.

All of this make the leaderboard unreliable

Topic		Replies	Views
Using AI to audit code and eradicate any weaknesses adversaries could possibly use and keep hidden away? Off Topic	0	204	February 20, 2024
Do NOT trust Apple Intelligence! General	4	596	July 11, 2024
What's the best Local AI Model to run? Questions	12	1539	March 21, 2025
Privacy with AI Questions	1	250	April 27, 2025
We Need To Stop ClosedAI General	0	246	June 23, 2024

Lmarena, Artificial Analysis

Related topics