Local AI for WordPress Development: A Benchmark

Five GGUF models. One RTX 5090. One little-coder agent. A WordPress-focused benchmark suite.

The question wasn’t whether local AI can write WordPress plugins. It was which one does it well enough to trust with a first draft.

The Prompt Suite

Five small but realistic WordPress development tasks. Same prompt wrapper, same expected files, same WordPress coding skill for every model.

Secure Shortcode — Create a plugin that registers a [bench_secure_cta] shortcode accepting text and url attributes. Sanitize inputs, escape output, return an accessible CTA link. Proper plugin header.

Settings Page — Admin settings page under Settings. Save a single option. Capability checks, register_setting, sanitize_text_field, settings_fields, nonces through the Settings API. The whole nine yards.

REST Endpoint — Route at /bench/v1/notes supporting GET and POST. Permission callbacks, sanitization, safe output. No public POST.

WooCommerce Discount — Apply a 10% discount to displayed product prices for logged-in customer role users. WooCommerce price filters, no admin screen changes, no guest mutations.

Dynamic Block — Minimal Gutenberg block rendering the three latest posts. WP_Query, reset post data, escaped titles and URLs.

The Setup

The agent was little-coder using local llama.cpp models through llama-server. All models ran on a single RTX 5090 with flash attention, 16K context, and prompt caching.

Shared params: --ctx-size 16384, --threads 16, --batch-size 2048, --flash-attn on, --cache-prompt.

Five models, all GGUF:

Qwen3 Coder Next Q4 — The coding specialist
Qwen3.6 35B A3B Q4 — Massive MoE, mixed precision
Qwen3.6 27B Q6 — Dense, high quantization
Gemma 4 31B Q6 — Google’s dense contender
Gemma 4 26B A4B Q6 — MoE variant, high quantization

The Leaderboard

Model	Score	Runtime
Gemma 4 26B A4B Q6	96.7%	11m 51s
Gemma 4 31B Q6	94.4%	4m 02s
Qwen3.6 27B Q6	92.5%	2m 53s
Qwen3 Coder Next Q4	88.0%	3m 34s
Qwen3.6 35B A3B Q4	84.2%	5m 57s

Gemma 26B A4B took the automated crown at 96.7%, but it needed nearly 12 minutes to complete all five tasks. Gemma 31B hit 94.4% in four minutes. Qwen3.6 27B did 92.5% in under three.

Where They Broke

The pattern was consistent across models: shortcodes, settings pages, and basic REST structure were easy. The weak points were always the same — runtime-sensitive WordPress details.

WooCommerce hooks — Getting the price filter right without breaking admin screens or mutating guest prices proved surprisingly hard. Qwen3.6 35B scored 62.5% here.
Dynamic block registration — Block metadata, render callbacks, API versions. Qwen3.6 35B scored 58.3%, missing query logic, escaped output, and block API version.
REST route signatures — Permission callbacks on POST endpoints. Every model except Qwen3.6 35B missed the public POST permission check.
Constants and API assumptions — Gemma 26B used a suspicious WP_REST_SUCCESS_CODE constant that would need runtime verification.

The Practical Winner

Qwen3.6 27B Q6 — 92.5% in 2m 53s.

It didn’t have the highest score. But it gave the best speed/quality balance. Three minutes for five WordPress plugins that mostly work is a different proposition than twelve minutes for five plugins that mostly work slightly better.

The Verdict

Local AI is good enough for useful first-draft WordPress plugin work. Not good enough to ship unsupervised.

Use these models for fast drafts, scaffolding, security checklists, and refactors. Keep human review and runtime validation in the loop for production work. The automated score is a baseline, not proof that a plugin works inside a live WordPress install.