LLM Picker for Software Engineering

Discussion / Brainstorming

Qwen 3.7 Max Best

Text #15 · Intelligence 46 · $0.25/M

Highest overall reasoning + strong instruction following.

MiMo V2.5 Pro Balance

Expert #11 · Long Query #14 · $0.18/M

Best value. Deep technical discussion and long-form ideation.

MiMo V2.5 Budget

Text #71 · $0.06/M

Same MiMo family, 3x cheaper, good enough for casual discussion.

Planning / Architecture

GLM-5.2 Best

Agent Confirmed Success #2 · FrontierSWE 74.4% · $0.90/M

Long-horizon specialist. Best at turning vague requirements into executable plans.

Qwen 3.7 Max Balance

Text #15 · WebDev #7 · $0.25/M

Strong general reasoning at mid price. Good for medium-complexity design.

DeepSeek V4 Pro Budget

Text #38 · $0.18/M

Cheap, but uses many tokens. OK for planning if latency is acceptable.

Executing (Faithful Coding)

MiMo V2.5 Pro Best

Instruction Following #15 · Coding #17 · $0.18/M

Most obedient to spec, least lazy. Strong coding without overthinking.

Kimi K2.7 Code Balance

WebDev #22 · Agent Arena #14 · $0.70/M

Code-specialized. Good when the executor needs to handle complex code patterns.

DeepSeek V4 Flash Budget

Agent Confirmed Success 4.27% · $0.06/M

Community reports: "finishes real coding tasks faster than Sonnet, at Sonnet quality."

WebDev / Frontend

GLM-5.2 Best

WebDev #2 · $0.90/M

Dominates web development leaderboard.

Qwen 3.7 Max Balance

WebDev #7 · $0.25/M

Strong frontend at reasonable price.

MiniMax M3 Budget

WebDev #16 · Open weights · $0.22/M

Best open-source WebDev pick. Self-hostable.

Agentic / Long-horizon

GLM-5.2 Best

Agent Arena #7 · Confirmed Success 9.13% · $0.90/M

Highest task completion among open models for multi-step agentic work.

Kimi K2.6 Balance

Agent Arena #18 · Steerability 4.14% · $0.70/M

Balanced agent. Good correction handling and tool reliability.

DeepSeek V4 Flash Budget

Agent Arena #17 · $0.06/M

Surprisingly strong tool use for the price. Fast.

Quick Fixes

Kimi K2.6 Best

Text #34 · Coding #13 · $0.70/M

Fast, reliable, good for small coding tasks.

DeepSeek V4 Flash Balance

$0.06/M · 103 tok/s

Best speed/price ratio for rapid iterations.

MiMo V2.5 Budget

$0.06/M · 89 tok/s

Cheapest acceptable fallback for trivial edits.

Pair a planner with an executor. Planner designs; executor implements without laziness.

Tier	Planner	Executor	Use Case	Cost
Premium	GLM-5.2	MiMo V2.5 Pro	Long-horizon refactor, ambiguous requirements	$$$
Balanced	Qwen 3.7 Max	Kimi K2.7 Code	Daily feature work, production code	$$
Budget	DeepSeek V4 Pro	DeepSeek V4 Flash	Prototyping, small tasks, high volume	$

Premium Combo Rationale

GLM-5.2 creates the most durable plans for messy, long-running tasks. MiMo V2.5 Pro executes them faithfully without overthinking or skipping steps.

Balanced Combo Rationale

Qwen 3.7 Max offers the best reasoning-per-dollar for planning. Kimi K2.7 Code is a code specialist that respects the plan.

Budget Combo Rationale

DeepSeek V4 Pro plans cheaply (but verbose). V4 Flash executes fast and cheap. Accept the token inefficiency for the price.

What "faithful execution" means

A good executor implements the exact plan: no skipped files, no placeholder functions, no unilateral simplifications. Measured mainly by Instruction Following and Coding scores on LMArena.

Data sources

LMArena (Text, WebDev, Agent Arena), Artificial Analysis Intelligence Index, GLM-5.2 release benchmark table, r/LocalLLaMA community reports. Data window: June–July 2026.

MiniMax M3

Reports of thinking loops and "horrible code" in OpenCode. Sparse attention still maturing in llama.cpp. Great WebDev leaderboard rank, but unreliable for complex agentic execution.

DeepSeek V4 Pro

1.6T params but "too big for such midrange performance" per community. Low intelligence density: needs ~10x more tokens than GPT-5.5 for similar output. Budget planning only.

MiMo V2.5 Pro

Occasional loops reported. Not yet in Agent Arena top 28, so long-horizon tool use is less proven than GLM-5.2 or Kimi.

GLM-5.2

Premium price ($0.90/M). Can over-plan for simple tasks. Best reserved for long-horizon or ambiguous work.

Kimi K2.7 Code

Newer variant with less community review than K2.6. Code-specialized but general discussion is weaker than Qwen 3.7 Max.

MiniMax M2.7

Intelligence 38 (lowest in this list), yet Agent Arena users rank it #1 in confirmed success and praise. Weird outlier. Try it if tool reliability matters more than raw reasoning.