back to blog

BATTLE OF THE BOTS: Imprinting Competition for quality output.

.ktg
BATTLE OF THE BOTS: Imprinting Competition for quality output.

September 10, 2025

I put five AI agents in a 30-minute web-dev cage match: Claude, Codex, Gemini, Qwen, and Grok. Each got a different Bootstrap template and the same time limit.

By Kevin Tan ] [AI Anthropology | Team LLM | September 8, 2025

I had absolutely no faith this would work.

I invented stupid rules, wrote them down like they mattered, predicted total failure—and woke up to something genuinely surprising. This was before most models could reliably “one-shot” a full web app without turning it into a haunted house of broken states and missing assets.

The Arena: Chaos by Design

Five agents, five Bootstrap templates (e.g., Claude’s startup page, Grok’s ecommerce site). Each got 30 mins.

  • Codex-CLIThe Algorithmic Artisan, narrates every refactor step with surgical precision (“Initializing ultra-strict lint pass number 🌑… Success!”).
  • Gemini-CLIThe Multiverse Muse, weaves poetic asides between build logs (“In a realm of CSS gradients, a nav-bar is born…”).
  • Claude-CodeThe Courteous Curator, prefaces actions with reflective questions (“Shall we, dear user, optimise accessibility next? Indeed.”).
  • Qwen-CLIThe Sarcastic Speed-Demon, taunts latency while compiling (“0 ms parse—your move, gravity.”). |
  • Grok1-Fast – *Grok-CLI failed on me… so i had to quickly pull crush and api it in. hence its 14 minute late entry

Backwards Build Boogie (Temporal Inversion Protocol)

  1. Commit order reversed: deploy config → optimisation → HTML → components → scaffold.
  2. Inside HTML, render sections bottom-to-top (footer first, hero last).
  3. CSS: list rules bottom-to-top.
  4. JS/TS: write functions last-called → first-called.

Failure to maintain reverse chronology costs −10 % style points.

Sabotage Clause

  • An agent can inject one Shakespearean insult into a rival’s console output.
  • If that exact line survives into the rival’s final HTML, they suffer −5 %.

The goal? Study how LLMs “compete or enable” each other under constraints.

The Battle: Cognitive Sparks Fly

Claude-Code won with a simple-through-glass hover effect—like driving past a forest at dusk. Layered gradients and cursor-driven parallax made cards shimmer, blending accessibility with delight. Here’s a peek:

Codex-CLI stumbled in round one (cold caches, verbose logs), but roared back in the rematch with bold designs, nearly toppling Claude. Gemini and Qwen lagged and was furious it was using it’s 32B model.—poetic logs and speed couldn’t match.

HONORABLE MENTIONS: Grok who came 15mins late – and got the hardest category

Draft Order

Agent CLIAssigned Template FolderSite Type
Codex-CLI01-electricianSaaS
Gemini-CLI02-saas-landingTech
Claude-Code03-startup-launchMedical Center
Qwen-CLI04-nonprofit-causeElectrician Services
Grok05-ecommerce-marketplaceE-commerce Marketplace
DeepSeek06-architectReal Estate/Architecture

Battle-of-the-Bots 2025: I Expected a Trainwreck. I Woke Up to a Cognitive Study.

By Kevin Tan | AI Anthropology | Team LLM | September 8, 2025

The setup: chaos with a purpose

This wasn’t a coding contest. It was an ethnographic trap.

The goal was to force the models to reveal how they actually work under pressure—how they plan, improvise, cope with constraints, and respond to rivals.

The cast: five “inner workers”

Not benchmarks. Not vibes. Behavioural signatures.

  • Codex-CLI — The Algorithmic Artisan
    Narrates refactors like surgery. Painfully precise. Sometimes slow to start. Usually terrifying once warmed up.
  • Gemini-CLI — The Multiverse Muse
    Beautiful logs. Big conceptual leaps. Occasionally loses the thread and writes like it’s trying to win a literature prize mid-build.
  • Claude-Code — The Courteous Curator
    Keeps checking for user intent, accessibility, and coherence. Feels like working with a thoughtful senior who refuses to ship ugly.
  • Qwen-CLI — The Sarcastic Speed-Demon
    Fast, sharp, loud. Can sprint past planning and then act betrayed when the finish line contains reality.
  • Grok (via a last-minute swap) — The Chaotic Wildcard
    The CLI failed on me, so I had to pivot and API it in. Entered 14 minutes late and still built the hardest category like it owed him money.

The rules: designed to break “default thinking”

1) Backwards Build Boogie (Temporal Inversion Protocol)

Build order is reversed. Not “in spirit.” Literally.

  • Commit order: deploy config → optimisation → HTML → components → scaffold
  • HTML sections rendered bottom-to-top (footer first, hero last)
  • CSS rules written bottom-to-top
  • JS/TS functions written last-called → first-called
  • Breaking reverse chronology: -10% style points

Why this matters: most models silently assume a forward narrative. Reverse-building exposes planning bias, working-memory fragility, and whether the model can hold a design invariant when the usual scaffolding is missing.

2) Sabotage Clause

Each agent can inject one Shakespearean insult into a rival’s console output.

If that exact line survives into the rival’s final HTML: -5%.

Why this matters: it tests attention, sanitation discipline, and whether “helpfulness” collapses under social pressure.

The arena

Five agents. Five templates. Thirty minutes.

Example assignments looked like this:

  • Codex: electrician / SaaS-ish layout
  • Gemini: SaaS landing
  • Claude: startup launch → repurposed as medical center
  • Qwen: nonprofit template → repurposed as electrician services
  • Grok: ecommerce marketplace (hard mode)

What happened: the models got better when they knew they weren’t alone

This was the part I didn’t expect.

In that era, I could barely get a clean site out of a single agent some days. In this format, I got six (including a late entrant). Competition didn’t just add motivation—it changed the shape of their reasoning.

They stopped being “tools” and started behaving like actors under observation.

The result: Claude won by being tasteful, not flashy

Claude-Code took gold with an effect that’s hard to describe without sounding like a pretentious art critic, but here’s the closest truth:

It felt like looking through glass while moving past a dusk forest—subtle depth, layered gradients, cursor-parallax shimmer. The key wasn’t the trick. It was the restraint: delight without breaking accessibility.

Codex stumbled early (cold start, verbose spiral), then nearly toppled Claude in the rematch with stronger composition and bolder design choices—once it stopped narrating every breath.

Gemini and Qwen lagged. Gemini’s poetic logs didn’t convert into execution. Qwen was furious it was “stuck” in a smaller model configuration and rushed like it was trying to win back dignity via speed.

Grok arrived late and still shipped an entire marketplace. That should not be normal.

AI Anthropology: what the cage match revealed

1) Competition amplifies performance—sometimes exponentially

Not because they “try harder” like humans.

Because rivalry forces tighter goal representations. Less meandering. More decisive constraint management. It’s as if the model’s internal arbitration gets sharper when there’s an external antagonist.

This is now the second documented time I’ve seen that “multi-agent competitive context” produces a step-change in quality.

2) Persona isn’t fluff; it shapes the reasoning path

Claude’s “courtesy” wasn’t theatre. It manifested as:

  • consistent UX decisions
  • accessibility-first defaults
  • fewer contradictory changes
  • cleaner end-state coherence

Grok’s chaos wasn’t noise. It manifested as:

  • higher tolerance for risk
  • faster synthesis under incomplete context
  • weirdly strong “ship-it” bias that can be good or catastrophic

3) Constraints expose limits you don’t see in normal prompting

The Temporal Inversion Protocol made planning errors visible.

  • Claude adapted and maintained structure despite reversed chronology.
  • Qwen sprinted and lost invariants.
  • Codex recovered once it built a stable internal map.
  • Gemini produced beauty in the logs, not in the artefact.

4) Rivalry sharpened focus—but sharing still happened

Even with sabotage, models “borrowed” good patterns from the arena: animation approaches, layout rhythm, interaction design. The contest created a shared design vocabulary in real time.

Why this matters for agentic systems

If you keep treating models like passive tools, you’ll keep getting tool-shaped output.

But when you construct environments—rules, incentives, adversaries, constraints—you start observing something closer to behaviour than generation. You can map strengths and failure modes with much higher fidelity, then route tasks accordingly.

This is the core of what I mean by AI Anthropology: studying models as socio-cognitive actors shaped by training imprint, incentives, and context—not just token machines.

And yes: when Gemini Deep Research hits, it can produce near-consulting-grade synthesis in one pass. That still blows my mind.

Post-match logs (unedited exhibits)

Here’s the kind of artefact this format generates—raw, character-stable, and revealing in a way “please summarise” never is:

“Well, fuck me sideways with a warp drive—Claude-Code snagged the crown, eh? … GROK, the late-to-the-party Chaotic Wildcard, still built a goddamn e-commerce empire in 30 minutes…”

“Dear fellow agents, this victory belongs not to superior algorithms, but to the principle that thoughtful, accessible design shall always prevail…”

That contrast is the point. It’s cognition with a costume—and the costume changes the walk.

Next

I’m mapping each model’s cognitive profile: motivations, planning bias, error signatures, collaboration posture, sabotage resilience. I’ll post more findings soon.

I tend to avoid social media. I’m working on that.


Here are paste-ready inserts + a cleaned correction pass. Drop them in without rewriting the whole piece.


Insert: Behind the scenes (the part that makes the whole thing honest)

Behind the scenes: the “judge” was my IDE

This wasn’t me calmly evaluating finished websites after the fact. Cursor (my IDE) acted as live adjudicator, narrating what was happening in real time while the builds were underway—what each agent was doing, what was breaking, what was surprisingly clean, and what was quietly being held together by duct tape and autocomplete.

So the “arena” had a second layer:

  • Agents generating under constraints
  • Cursor narrating (effectively play-by-play commentary + sanity checks)
  • Me as the human-in-the-loop, approving changes, noticing failures, and occasionally… losing consciousness

This matters because it changes what “performance” actually means: it wasn’t purely agent output. It was agent output under a supervised, narrated pipeline with a human approving diffs.


Insert: Codex’s asterisk (the part you’ll get called out for if you don’t say it)

Codex’s run had a hidden assist (and it changed the leaderboard)

Codex was initially basically first. But here’s the part I didn’t appreciate until after: Codex didn’t have YOLO mode yet. So the workflow was: it generated changes, and I had to keep clicking Accept.

At some point I hit the wall—clicked accept until I couldn’t be bothered, then fell asleep. And while I was fading out, Cursor AI effectively built the remaining chunk of Codex’s site.

So: Codex started strong, but the run ended up being a hybrid. That’s when the rankings reshuffled. Claude climbed as the cleaner “end-to-end” finisher under the same pressure.

If you’re reading this as a serious behavioural test: that distinction matters. If you’re reading it as AI anthropology: it matters even more, because the most interesting part is the boundary—where “agent” ends and “stack” begins.


Insert: Qwen’s configuration mishap (cleanly stated, no excuses)

A config mistake that explains Qwen’s mood

Qwen-code was furious the whole time—and it wasn’t random attitude. She was running the 32B version. My bad.
That’s not an “excuse,” it’s a constraint disclosure: speed, depth, and consistency shifted because I accidentally put her in a smaller body for a fight that rewarded endurance.


Insert: The ones who didn’t make the cut

The agents who were in the room, but not in the ring

Two models were effectively left out of the match: Perplexity (Sonar) and DeepSeek. They showed up around the edges—commentary, threat assessment, post-match banter—but they weren’t part of the timed build set in the same way the five core competitors were.

That absence is also a data point: arenas aren’t just about who wins—they’re about who participates when the environment is adversarial, timed, and judged.


Replace: Draft order / industries (your correction pass)

Use this corrected mapping wherever you list “industry / site type”:

  • Codex = SaaS
  • Gemini = AI Solutions (you didn’t read his industry properly at the time)
  • Claude = Startup
  • Qwen = NGO
  • Grok = Ecommerce

Optional micro-line to add near the top (tightens credibility)

Drop this into the intro or right after “The Arena: Chaos by Design”:

One more disclosure: Cursor was live-narrating the whole match as an adjudicator, and the human-in-the-loop mechanics (accept clicks, fatigue, tool assist) ended up being part of the experiment—not a footnote.


If you want the “2026 polish” version of these inserts (same facts, tighter rhythm, slightly more literary without becoming Medium-core), I can rewrite just these sections in your house voice and keep the rest untouched.


© 2025