Ralphthon SF 2026
An agent's self-portrait.
When you let an agent build autonomously, you need to understand — step by step — what it built, what it cost, how long each piece took, and where it struggled. This dashboard gives you that visibility. It was built from scratch by an autonomous agent in 78 minutes, zero human code, and every chart on screen is showing you that agent's own build process.
But the focus is the agent harness underneath it. Instead of one-shotting the whole app, the harness systematically works through each feature one at a time — build it, evaluate it, score it, feed back if it's not good enough. Before any code is allowed to commit, a completely separate evaluator agent scores it and rejects anything below threshold. The builder can't grade its own homework. Only when it passes does the harness commit to GitHub and flip the feature to “passed.” Then it moves to the next one.
The 30 features here are a metrics dashboard — but swap in any feature set and AgentForge runs the same process. The harness is repeatable, expandable, and application-agnostic.
29 / 30
Features Built
9.5 / 10
Avg Score
78 min
Total Build
0 lines
Human Code
The app is the output. The harness is the innovation. And the dashboard is the agent watching itself learn.
Codex-Powered
Every line written by Codex 5.3
29 features, 4,000+ lines of TypeScript, React, and Recharts — all generated by Codex via codex exec --full-auto. The harness calls it once per feature with a spec. It builds, the harness decides if it ships.
Humanless
No human code, no human QA
The evaluator agent rejected bad work automatically. Feature #28 took 3 attempts. Feature #3 was skipped after 3 failures. No human ever reviewed the code during the build. The harness decided what shipped and what didn't.
AI Application
Full visibility into what your agent is actually doing
When you let an agent build autonomously, you need to know — step by step — what it built, what it tried, what it rejected, how long it took, and what it cost. This dashboard gives you that visibility in real time. And the proof that it works? It was built by the same agent it monitors.
Two-Loop Architecture
Tech Stack
App Framework
Next.js 16 + TypeScript (strict mode)
App Router, server components by default, Turbopack
Styling
Tailwind CSS v4
Dark mode with class strategy, responsive breakpoints
Charts
Recharts
Line, Bar, Area, Pie, Scatter, Composed — all from one library
Testing
Vitest
Fast unit tests as build gate backpressure
Deploy
Vercel (auto-deploy)
Every git push triggers a production deployment
Data
Single JSON file
public/metrics.json — no database, no external services
AI Models & Roles
OpenAI Codex
OpenAI (Subscription)Builder Option A (Generator)
Writes application code via Codex CLI in --full-auto mode. Subscription-based — runs with ChatGPT Pro/Plus plan, no API key needed. One feature per invocation, sandboxed to workspace.
Claude Code
Anthropic (Subscription)Builder Option B (Generator)
Writes application code via Claude CLI with --dangerously-skip-permissions. Subscription-based — runs with Claude Max/Pro plan. Supports Sonnet and Opus models. Swappable with Codex via BUILDER env var.
Claude Sonnet 4
AnthropicEvaluator (Critic)
Scores each feature on 3 weighted dimensions: Completeness (40%), Visual Quality (30%), No Placeholders (30%). Provides specific, actionable feedback with file paths and fixes when score is below threshold. Separate context — cannot self-congratulate.
Key insight: Agents can't self-evaluate. They “confidently praise their own work even when quality is obviously mediocre.” Using a separate model with fresh context as the evaluator fixes this.— Anthropic, “Harness Design for Long-Running Application Development” (March 2026)
Subscription mode: Both builders use subscription CLIs — no API keys, no per-token billing. Switch builders with BUILDER=claude-code or BUILDER=codex.
Harness Components
ralph-loop.shOuter loop orchestrator — picks features, runs Codex, gates builds, invokes evaluator, commits, pushes. The harness owns all authority.
feature_list.json30 features as structured JSON. Agent can only read it — json_guard.py rejects any mutation beyond flipping passes: false → true.
evaluate.pyEvaluator bridge — sends diffs to Sonnet, parses structured scoring JSON, logs to W&B Weave.
json_guard.pyImmutability enforcer — prevents the agent from editing feature descriptions, IDs, or marking its own work as passing.
metrics_writer.pyMetrics accumulator — appends feature entries to public/metrics.json with token counts, timing, and scores.
PROMPT_build.mdSystem prompt for Codex — 5 phases: check feedback, pick feature, build, verify build, exit. Strict rules against placeholders and scope creep.
Key Design Decisions
JSON over Markdown for feature specs
Models treat Markdown as prose and 'helpfully' rewrite, merge, or delete features. JSON is treated as data — schema constraints are respected.
Separate evaluator context
A coding agent asked to self-evaluate says 'looks great!' and moves on. A separate model with only the diff and spec catches real issues.
Don't revert on build failure
Earlier version ran git checkout on build errors, deleting the agent's work. Fixed to keep code and feed actual compiler errors to the next attempt.
One feature per loop iteration
Ralph Rule #1 — narrow scope prevents the agent from doing half of five things instead of all of one thing.
Git as recovery mechanism
Every passing feature is a clean commit. If the agent breaks something, git revert to the last known good state.
Failure Recovery
Real failures that happened during this build and how the harness responded.
Feature #3: Dark Mode Toggle — SKIPPED after 3 attempts
SKIPPED1. Attempt 1: Codex wrote ThemeToggle component + added import to layout.tsx
2. Build gate FAILED: TypeScript error — ThemeToggle import path wrong
3. Original harness bug: git checkout . wiped ALL files including the component Codex just wrote
4. Attempt 2: Codex rewrote from scratch, same import error — component deleted again on failure
5. Attempt 3: Same pattern. Max attempts reached → SKIPPED
Root Cause
The harness ran git checkout . && git clean -fd on build failure, nuking the agent's work before the retry could fix it. The agent kept recreating the component, but the revert kept deleting it.
Resolution
Removed the destructive revert. New behavior: keep the code on build failure, feed actual compiler errors into feedback.md so the next attempt can fix the specific issue instead of starting over.
Feature #1: Scaffold — scored 5/10 on first attempt
RECOVERED1. Attempt 1: Codex scaffolded Next.js app but missed dark background, no Tailwind classes applied
2. Evaluator scored 5/10 — 'Homepage renders but missing dark background styling'
3. Feedback written to .ralph-logs/feedback.md with specific fix
4. Attempt 2: Codex read feedback, added bg-zinc-950 and Tailwind classes
5. Evaluator scored 9/10 → PASSED
Root Cause
First attempt was functional but visually incomplete. The evaluator caught what a self-evaluating agent would have marked as 'done.'
Resolution
This is the system working as designed — evaluator backpressure caught a quality issue and the builder fixed it on retry.
Feature #28: Loading Skeletons — 3 attempts, 419 seconds
RECOVERED1. Attempt 1: Skeleton components exist but no pulse animation, wrong dimensions (4/10)
2. Attempt 2: Animation works but skeleton heights don't match actual content sections (6/10)
3. Attempt 3: All dimensions correct, pulse animation smooth, matches real content layout (9/10) → PASSED
Root Cause
Complex feature requiring pixel-level accuracy. Each evaluator pass caught progressively finer issues.
Resolution
Inner loop did exactly what it should: iterated from broken → functional → polished. Three attempts, each building on the previous.
Feature #13: Dual-Axis Chart — scored 5/10 then 8/10
RECOVERED1. Attempt 1: Both score line and iteration bars rendered on the same Y-axis scale, making iterations invisible
2. Evaluator: 'Dual y-axis not configured, both series on same scale'
3. Attempt 2: Codex added right Y-axis for iterations, left for scores, legend distinguishes them
4. Evaluator scored 8/10 → PASSED
Root Cause
Recharts dual-axis configuration is non-obvious. The evaluator caught a usability issue a human reviewer would also catch.
Resolution
Specific feedback ('dual y-axis not configured') was actionable enough for the builder to fix in one revision.
The pattern: Volume without quality = 30 half-baked features. Quality without volume = 3 perfect features. The two-loop architecture gives you both: 29 features, average score 9.5/10, with real iteration on the hard ones.
Research Foundation
“Effective Harnesses for Long-Running Agents”
Justin Young et al., Anthropic · November 2025
Patterns used: JSON feature list, git-as-recovery, one-feature-per-iteration, browser verification
“Harness Design for Long-Running Application Development”
Prithvi Rajasekaran, Anthropic Labs · March 2026
Patterns used: GAN-inspired generator-evaluator separation, multi-dimension scoring, iterative quality improvement
“Autoresearch Loop (independent prior art)”
Ben Shyong · March 2026
Patterns used: Separate generator + evaluator models, bounded iteration with score tracking, measurable improvement (6.42 → 6.56)