Do you trust benchmarks like lmarena, Artificial Analysis, and how should one use them as a guide? How do you determine which AI to use for a specific task, and how do you choose AI for different tasks? Do you read posts or watch expert videos? I would appreciate any recommendations or resources.
Those benchmarks don’t have anything to do with privacy, as far as I can tell, so I wouldn’t pay any attention to it. Please correct me if I’m wrong.
If I use an AI at work, I have to use one through a browser because I’m not allowed to install Ollama. So I don’t think about it too much there.
But at home I use the Ollama app because it enables AI locally, which means more privacy.
As far as choosing particular models, I’m still trying out different ones. You can see how many are available here: Ollama Search
So far I can’t say that any of the ones I’ve tried is a clear winner. All AI’s are fairly stupid in their own way. But I’ll keep testing them one by one and maybe I’ll find one which I keep. At some point I’ll probably post about my results here on the forum.
Lmarena has essentially been gamed. I am looking at benchmark analysis capabilities I care about, mainly math and code. AiderChat, MathArena are two good benchmarks.
Wdym ?
I think possibly they are referring to the reasonable probability or plausibility that model makers are training models (or specific versions of models) to score highly in LMarena. (“teaching the test” essentially, and gaming human nature via flattery/sycophancy)
Hard Fork (podcast) touched on this a few weeks ago. The relevant segment starts at the 50min mark.
Also, a study has shown that providers can test many private models and choose only the best one to display, closed-source models are served more often than open-source/open-weights models, open-source/open-weights models are retired earlier than their closed-source counterparts, etc.
All of this make the leaderboard unreliable