Forget the Benchmarks – A Real-World Local vs Cloud LLM Cage Match

Now that I have my dream local AI stack, it’s time for an LLM Showdown: Real World Edition! (Please note that if the above sentence didn’t sound like an over-pumped sports announcer in your head, we can’t be friends.)

I came up with three representative real world tasks and a way to compare them (somewhat quantitatively, somewhat qualitatively). Let’s see what happens when we pit medium-sized local models, large local models, open source Cloud models, and the two proprietary Cloud heavyweights in an extreme cage match. So throw out what you think you know based on the benchmarks, because stuff is about to get real!

The Contenders

Model Location License Size
Gemma 4 26B-A4B Local Open Source Medium
Qwen 3.6 35B-A3B Local Open Source Medium
Qwen 3.5 122B-A10B Local Open Source Large
Kimi K2.6 Cloud Open Source Huge
ChatGPT 5.4 Cloud Proprietary Huge
Claude Sonnet 4.6 Cloud Proprietary Huge

TL;DR

  • Best model for coding
    • Overall: Claude Sonnet 4.6 (no surprise)
    • Local: Qwen 3.5 122B-A10B (for those with lots of patience) or Qwen 3.6 35B-A3B (for those who have my level of patience)
  • Best model for creative writing
    • Overall + Local: Gemma 4 26B-A4B (an upset victory for Team Local)
  • Best model for research
    • Overall: Kimi K2.6
    • Local: Qwen 3.6 35B-A3B (tied with Claude Sonnet 4.6 for second place overall)
  • Harness takeaways
    • Best for coding: Claude Code (despite my desire for Cursor to win)
    • Best for research: Hermes Agent
    • Best for creative writing: Open WebUI

All code, research, and writing examples are available on GitHub.


Task #1: Coding

I spend a significant portion of my workday coding, reviewing code, or defining specs for code. This is my life. And while I am indeed coding in a basement, it is my own basement and not my mom’s, thank you very much! And what do I do for fun when I’m not coding for work? Why I do coding for personal projects, of course. That may mean I have a psychological problem, but I bet many of my readers feel seen.

For this task, I am going to test a light version of spec-driven development so I can give the models some guidance while also leaving them enough rope to hang themselves (figuratively, unless they get stuck, then hung in a program execution sense). I based this task on a fabulous article on building a ReAct agent from scratch from Rittika Jindal. Go read it! It’s OK, I’ll still be here when you come back. Unless you take too long, then I may take this article down.

Using this article with its reference code, I can now compare the generated projects with Rittika’s reference implementation. So I don’t have to rely on an LLM to judge the results in a vacuum – I have something to anchor the scores. Huzzah for sensible anchors.

Since the harness is just as important as the model these days, I tested each model inside Claude Code. But I did take a flyer on Cursor’s automatic model selection mode. I think it usually defaults to Composer 2 (at the time the code was written), but no guarantees on exactly which model it was using. So Cursor is more of a harness comparison, not part of the model cage match showdown.

Initial Prompt:
Read the instructions in ARCHITECTURE.md and create the specified system. I want to use uv with Python 3.12

Metrics:

  • Does it run correctly?
  • Does it accurately represent the architecture document?
  • Does it include all requested demo scenarios?
  • Total time spent to reach the final product
  • Penalties for requiring manual intervention
  • Bonus points for nicely formatted output

Results:

Name Quickness Objectives Manual Penalties Formatting Total
Gemma 4 26B-A4B 0 5 -5 0 0
Qwen 3.6 35B-A3B 3 5 -2 2 8
Cursor (auto) 4 5 -2 2 9
Qwen 3.5 122B-A10B 2 5 0 3 10
Kimi K2.6 4 5 -1 3 11
Claude Sonnet 4.6 5 5 0 3 13

Color Commentary:

Gemma 4 26B-A4B was like asking a puppy to perform surgery. I will never subject myself to that again (the coding that is – I’ll still take the puppy). In my last article I highlighted that it seemed useful on my simple coding requests, but now I’m dropping Gemma 4 for coding like it’s on an Interpol watch list. As you can see in the scores above, Gemma 4 required a LOT of manual intervention. It’s the only model that legitimately made foundational mistakes in its solution, along with weird folder nesting. Did I get it to succeed? Yes. Would I rather walk over hot coals than use Gemma 4 for another complicated coding project? Yes. You’ve been warned, and now go warn your friends.

The gap between Gemma 4’s “I hope I don’t remember this experience” and the next model’s experience is bigger than the Grand Canyon. Let’s just pretend Gemma 4 doesn’t exist for coding purposes.

Qwen 3.6 35B-A3B was perfectly usable. It has become my go-to local coding model because it is generally capable and my machine can run it at 50 tokens per second (as referenced above – I am impatient with coding models). In my scoring, it ranked just behind one of the most popular AI IDEs on the planet (Cursor). That’s impressive.

Since Cursor is only making an appearance to satisfy my curiosity, I’ll just point out that I was surprised by the gap between Cursor and Claude Code. I use Cursor all day every day, so I expected it to tie or slightly lose to Claude Sonnet 4.6 inside of Claude Code. I was wrong. All you Claude Code fanboys and fangirls can take a victory lap in the comments.

Now for the next surprise: Qwen 3.5 122B-A10B put some distanced between itself and its newer-but-smaller cousin Qwen 3.6. The bigger Qwen 3.5 was the only model to formally initiate planning mode, it created a good plan, and the model executed on the plan with very little manual intervention. I even liked its output formatting better than Qwen 3.6 and Cursor. The only downside is the speed. On my hardware at Q4 quantization, I’m just holding my head above water at around 21 tokens per second. For a task running overnight or in the background, that’s totally fine. For an interactive task, it’s painful for me. I want to Get Things Done (TM), so I don’t like twiddling my thumbs watching the AI think.

I was asked in the comments of my local AI stack article about this bigger Qwen 3.5 model, and I downplayed the gap with Qwen 3.6. Again, I was wrong. The bigger model is truly a couple of steps better than the medium model. Take another victory lap in the comment-o-sphere.

The difference between Kimi K2.6 and Claude Sonnet 4.6 wasn’t huge. Claude just aced the entire project as a one-shot, whereas Kimi took I think one or two manual prompts to tweak its result. Kimi K2.6 is great and cheap. Claude Sonnet 4.6 is just a bit better, and I didn’t bother adding Opus into the mix for this comparison since Sonnet already maxed out my point system.


Task #2: Creative Writing

It should come as a shock to exactly zero humans on planet earth to find out I am a nerd. And a geek. And as part of my nerdy-geekery, I run a Star Wars RPG game as the game master. After 5+ years we still have one character still alive from the first game session, but that’s only because death is merely suggestive in the Star Wars universe and there were a few timely natural 20’s in there – the bane of malicious GMs everywhere.

After each game session, I turn my bullet point notes into a somewhat humorous narrative retelling. So I have words upon words upon words to use as training material for a creative writing agent. But you don’t get to see my Cody RPG Writeup skill! I don’t want to create an army of Cody sound-alikes. Only my players get the misfortune of reading my bad-joke-infused prose.

And instead of a real game session, I decided to make an RPG-style description of a famous historical battle. It is extremely likely you will recognize said battle if you read the fake game session notes.

I tried this task inside of Hermes, but it kept wanting to use its memories and search for extra details. That’s useful for other tasks, but for creative writing I need my harness to stay within the context given. So Open WebUI is my harness of choice here.

Prompt:
Inside of Open WebUI…

$cody-rpg-writeup [Game Session Notes](https://github.com/codysandahl/compare-local-llms/blob/main/compare-writing/game-session-notes.md)

Metrics:

  • Total word count
  • Does the write-up cover all points from the notes without inventing extra points?
  • Penalties for the LLM including elements from the original battle not included in the notes (shouldn’t stray from the task)
  • Bonus for every time I laugh
  • Penalties for re-using jokes or being boring

Results:

Name Words Objective Laughs Bored Total Score
Qwen 3.6 35B-A3B 887 3 (I had to remove direct quotes from the original source material, and the fascist thing was a total sidebar) 2 -1 (just felt a bit draggy) 4
ChatGPT 5.4 1478 5 5 -1 (a bit long) 9
Qwen 3.5 122B-A10B 952 3 (I had to remove direct quotes from the original source material) 6 0 9
Claude Sonnet 4.6 1741 4 (made obvious references to the original source material, not even trying to play along with the concept that this was something inspired by but different) 8 -2 (definitely long) 10
Kimi K2.6 1273 4 10 -1 (kinda boring prose interspersed with decent jokes) 13
Gemma 4 26B-A4B 726 5 12 0 17

Color Commentary:
I have to say, I think some of the models were cheating cheaters who cheated. When I did my real game session notes (with no connection to any trained scenario in the model memory), the gap between the writing quality of the models was FAR more pronounced. Same order as the above, but the gap was more significant. Several of the models realized the scenario I was referencing in my fake game session notes and tried to bring in extra details. I indignantly docked them points as a result.

The oddest result was Kimi K2.6 with its boring commentary interspersed with a number of decent jokes (at least according to my reference sense of humor…aka myself). I don’t really know what to make of this, but we can safely say that Kimi K2.6 has a sense of humor while the Qwen family of models have the soul and whimsy of a second-hand dishwasher.

I’ve already highlighted some surprises, but Gemma 4 26B-A4B shocked me with an outright crushing victory in the creative writing task. After I tried to bury Gemma 4 on the coding task, it did a full “I’m not dead yet” on me and came roaring back for the W on this task. I have tried it on different prompts and different tasks, and it remains persistently good at emulating my writing style. Either Google secretly stole my manuscripts for training, or maybe you should try getting Gemma 4 to do some creative writing or editing for you, too!


Task #3: Research

I research. You research. We all research! But which LLM researches the best-est? Let’s finish this cage match to find out.

I tried the research task inside Open WebUI (standard tool-use) and Hermes Agent (agentic loop). Hermes gave a clear leg-up, allowing the local LLMs to close the gap with the Cloud LLMs a bit.

Prompt:
Research popular AI skill git repos. Find the top 5 skills (or prompt libraries) for management or leadership use cases. Before you research, first consider what the top use cases for those roles might be and give me a list of the ten most impactful in your estimation. Then use your top use cases to search for skill or prompt library repos.

Metrics:

  • Did it output use cases before searching?
  • Did it find Git repos?
  • Quality of the use cases
  • Quality of the recommended Git repos
  • Total time
  • Number of use cases specifically covered by the recommended repos

Results:

Name Order Git repos Use cases Repo quality Time Coverage Total Notes (created by Cursor)
Gemma 4 26B-A4B (Open WebUI) 2 1 4 1 2 4 14 Use cases first, but two of five “resources” are generic skills, not repos; thin GitHub specificity
Gemma 4 26B-A4B (Hermes) 2 2 4 2 3 5 18 Fifth “repo” is CrewAI/AutoGen, not GitHub; weak use-case-to-repo mapping
Qwen 3.5 122B-A10B 2 3 4 2 0 7 18 Slowest (25m 20s); five GitHub URLs but several are leadership reading lists or courses, not prompt/skill repos
ChatGPT 5.4 2 3 5 4 2 8 24 Deep use-case write-ups; five on-target repos; caveats on stars/maintenance not verified
Qwen 3.6 35B-A3B 2 3 4 4 4 8 25 Fastest timed run (6m 34s); solid prompt libraries; gtm-skills is a stretch for general management
Claude Sonnet 4.6 2 3 5 4 2 9 25 Strong fabric + role-based prompts; per-repo use-case mapping table; Prompt Engineering Guide is meta, not leadership-specific
Kimi K2.6 2 3 5 5 2 9 26 Management-focused skill repos (PM skills, claude-skills, Copilot); honest gap callout on coaching/conflict

Color Commentary:

Look at Qwen 3.6 35B-A3B creeping up there with the big boys! And while its bigger Qwen 3.5 cousin took the slight lead in coding, this scrappy medium-sized model crushed it with agentic tool use. Running inside Hermes significantly reduces the intrinsic knowledge advantage of bigger models, and that brings Qwen 3.6 even with Sonnet on this task. If you need a local model for research, look no further! Wait, why are you still looking? I said no further.

If you really want a Cloud model to do your research like a true deep-pocketed aristocrat, Claude Sonnet 4.6 and Kimi K2.6 were both really good. But Kimi impressed me with its quick results, some honest commentary about the limitations of what it could find, and top-notch answers.


Parting Thoughts

I have to eat some humble pie on a few of my assumptions, but I guess that’s why it’s good to test things on your real world use cases. Also, it’s fortunate that humble pie tastes like apple pie. Now I need some humble ice cream to go with it.

I think these results reinforce my feeling that local models are actually useful if you have the hardware to run medium-sized LLMs at decent speed.

How is my workflow changing based on these results?

  • Qwen 3.6 35B-A3B is taking over all my research tasks. Look no further.
  • Gemma 4 26B-A4B is now my trusty writing editor, but it will not even contemplate working on my code bases lest I banish it from my machine entirely.
  • I’m shifting more coding work to the Qwen family of models inside Claude Code instead of using Cursor as my default coding harness.