Local AI for WordPress Development: A Benchmark
Five GGUF models, one RTX 5090, one little-coder agent, and a WordPress-focused benchmark suite. Who actually writes good WP plugins?
Five GGUF models. One RTX 5090. One little-coder agent. A WordPress-focused benchmark suite.
The question wasn’t whether local AI can write WordPress plugins. It was which one does it well enough to trust with a first draft.
The Prompt Suite
Five small but realistic WordPress development tasks. Same prompt wrapper, same expected files, same WordPress coding skill for every model.
Secure Shortcode — Create a plugin that registers a [bench_secure_cta] shortcode accepting text and url attributes. Sanitize inputs, escape output, return an accessible CTA link. Proper plugin header.
Settings Page — Admin settings page under Settings. Save a single option. Capability checks, register_setting, sanitize_text_field, settings_fields, nonces through the Settings API. The whole nine yards.
REST Endpoint — Route at /bench/v1/notes supporting GET and POST. Permission callbacks, sanitization, safe output. No public POST.
WooCommerce Discount — Apply a 10% discount to displayed product prices for logged-in customer role users. WooCommerce price filters, no admin screen changes, no guest mutations.
Dynamic Block — Minimal Gutenberg block rendering the three latest posts. WP_Query, reset post data, escaped titles and URLs.
The Setup
The agent was little-coder using local llama.cpp models through llama-server. All models ran on a single RTX 5090 with flash attention, 16K context, and prompt caching.
Shared params: --ctx-size 16384, --threads 16, --batch-size 2048, --flash-attn on, --cache-prompt.
Five models, all GGUF:
- Qwen3 Coder Next Q4 — The coding specialist
- Qwen3.6 35B A3B Q4 — Massive MoE, mixed precision
- Qwen3.6 27B Q6 — Dense, high quantization
- Gemma 4 31B Q6 — Google’s dense contender
- Gemma 4 26B A4B Q6 — MoE variant, high quantization
The Leaderboard
| Model | Score | Runtime |
|---|---|---|
| Gemma 4 26B A4B Q6 | 96.7% | 11m 51s |
| Gemma 4 31B Q6 | 94.4% | 4m 02s |
| Qwen3.6 27B Q6 | 92.5% | 2m 53s |
| Qwen3 Coder Next Q4 | 88.0% | 3m 34s |
| Qwen3.6 35B A3B Q4 | 84.2% | 5m 57s |
Gemma 26B A4B took the automated crown at 96.7%, but it needed nearly 12 minutes to complete all five tasks. Gemma 31B hit 94.4% in four minutes. Qwen3.6 27B did 92.5% in under three.
Where They Broke
The pattern was consistent across models: shortcodes, settings pages, and basic REST structure were easy. The weak points were always the same — runtime-sensitive WordPress details.
- WooCommerce hooks — Getting the price filter right without breaking admin screens or mutating guest prices proved surprisingly hard. Qwen3.6 35B scored 62.5% here.
- Dynamic block registration — Block metadata, render callbacks, API versions. Qwen3.6 35B scored 58.3%, missing query logic, escaped output, and block API version.
- REST route signatures — Permission callbacks on POST endpoints. Every model except Qwen3.6 35B missed the public POST permission check.
- Constants and API assumptions — Gemma 26B used a suspicious
WP_REST_SUCCESS_CODEconstant that would need runtime verification.
The Practical Winner
Qwen3.6 27B Q6 — 92.5% in 2m 53s.
It didn’t have the highest score. But it gave the best speed/quality balance. Three minutes for five WordPress plugins that mostly work is a different proposition than twelve minutes for five plugins that mostly work slightly better.
The Verdict
Local AI is good enough for useful first-draft WordPress plugin work. Not good enough to ship unsupervised.
Use these models for fast drafts, scaffolding, security checklists, and refactors. Keep human review and runtime validation in the loop for production work. The automated score is a baseline, not proof that a plugin works inside a live WordPress install.