AgentForge

Ralphthon SF 2026

An agent's self-portrait.

When you let an agent build autonomously, you need to understand — step by step — what it built, what it cost, how long each piece took, and where it struggled. This dashboard gives you that visibility. It was built from scratch by an autonomous agent in 78 minutes, zero human code, and every chart on screen is showing you that agent's own build process.

But the focus is the agent harness underneath it. Instead of one-shotting the whole app, the harness systematically works through each feature one at a time — build it, evaluate it, score it, feed back if it's not good enough. Before any code is allowed to commit, a completely separate evaluator agent scores it and rejects anything below threshold. The builder can't grade its own homework. Only when it passes does the harness commit to GitHub and flip the feature to “passed.” Then it moves to the next one.

The 30 features here are a metrics dashboard — but swap in any feature set and AgentForge runs the same process. The harness is repeatable, expandable, and application-agnostic.

29 / 30

Features Built

9.5 / 10

Avg Score

78 min

Total Build

0 lines

Human Code

The app is the output. The harness is the innovation. And the dashboard is the agent watching itself learn.

Codex-Powered

Every line written by Codex 5.3

29 features, 4,000+ lines of TypeScript, React, and Recharts — all generated by Codex via codex exec --full-auto. The harness calls it once per feature with a spec. It builds, the harness decides if it ships.

Humanless

No human code, no human QA

The evaluator agent rejected bad work automatically. Feature #28 took 3 attempts. Feature #3 was skipped after 3 failures. No human ever reviewed the code during the build. The harness decided what shipped and what didn't.

AI Application

Full visibility into what your agent is actually doing

When you let an agent build autonomously, you need to know — step by step — what it built, what it tried, what it rejected, how long it took, and what it cost. This dashboard gives you that visibility in real time. And the proof that it works? It was built by the same agent it monitors.

Two-Loop Architecture

RALPH LOOP (outer — iterates over 30 features) │ ├── Read state (feature_list.json + git log + progress) ├── Pick next feature (first "passes": false in JSON) │ ├── EVALUATOR LOOP (inner — up to 3 attempts per feature) │ │ │ ├── Attempt 1: Codex builds the feature │ ├── Gate: npm run build (must compile) │ ├── Evaluator (Sonnet) scores on 3 dimensions │ ├── Score < threshold? → feedback → retry │ │ │ ├── Attempt 2: Codex revises based on feedback │ ├── Gate + Evaluator re-scores │ ├── Score >= threshold? → PASS │ │ │ └── Attempt 3 (max): pass, accept-if-close, or skip │ ├── PASS → json_guard marks feature, git commit, write metrics ├── SKIP → revert, mark skipped, move on │ └── [loop restarts → next feature]

Tech Stack

App Framework

Next.js 16 + TypeScript (strict mode)

App Router, server components by default, Turbopack

Styling

Tailwind CSS v4

Dark mode with class strategy, responsive breakpoints

Charts

Recharts

Line, Bar, Area, Pie, Scatter, Composed — all from one library

Testing

Vitest

Fast unit tests as build gate backpressure

Deploy

Vercel (auto-deploy)

Every git push triggers a production deployment

Data

Single JSON file

public/metrics.json — no database, no external services

AI Models & Roles

OpenAI Codex

OpenAI (Subscription)

Builder Option A (Generator)

Writes application code via Codex CLI in --full-auto mode. Subscription-based — runs with ChatGPT Pro/Plus plan, no API key needed. One feature per invocation, sandboxed to workspace.

Claude Code

Anthropic (Subscription)

Builder Option B (Generator)

Writes application code via Claude CLI with --dangerously-skip-permissions. Subscription-based — runs with Claude Max/Pro plan. Supports Sonnet and Opus models. Swappable with Codex via BUILDER env var.

Claude Sonnet 4

Anthropic

Evaluator (Critic)

Scores each feature on 3 weighted dimensions: Completeness (40%), Visual Quality (30%), No Placeholders (30%). Provides specific, actionable feedback with file paths and fixes when score is below threshold. Separate context — cannot self-congratulate.

Key insight: Agents can't self-evaluate. They “confidently praise their own work even when quality is obviously mediocre.” Using a separate model with fresh context as the evaluator fixes this.— Anthropic, “Harness Design for Long-Running Application Development” (March 2026)

Subscription mode: Both builders use subscription CLIs — no API keys, no per-token billing. Switch builders with BUILDER=claude-code or BUILDER=codex.

Harness Components

ralph-loop.sh

Outer loop orchestrator — picks features, runs Codex, gates builds, invokes evaluator, commits, pushes. The harness owns all authority.

feature_list.json

30 features as structured JSON. Agent can only read it — json_guard.py rejects any mutation beyond flipping passes: false → true.

evaluate.py

Evaluator bridge — sends diffs to Sonnet, parses structured scoring JSON, logs to W&B Weave.

json_guard.py

Immutability enforcer — prevents the agent from editing feature descriptions, IDs, or marking its own work as passing.

metrics_writer.py

Metrics accumulator — appends feature entries to public/metrics.json with token counts, timing, and scores.

PROMPT_build.md

System prompt for Codex — 5 phases: check feedback, pick feature, build, verify build, exit. Strict rules against placeholders and scope creep.

Key Design Decisions

JSON over Markdown for feature specs

Models treat Markdown as prose and 'helpfully' rewrite, merge, or delete features. JSON is treated as data — schema constraints are respected.

Separate evaluator context

A coding agent asked to self-evaluate says 'looks great!' and moves on. A separate model with only the diff and spec catches real issues.

Don't revert on build failure

Earlier version ran git checkout on build errors, deleting the agent's work. Fixed to keep code and feed actual compiler errors to the next attempt.

One feature per loop iteration

Ralph Rule #1 — narrow scope prevents the agent from doing half of five things instead of all of one thing.

Git as recovery mechanism

Every passing feature is a clean commit. If the agent breaks something, git revert to the last known good state.

Failure Recovery

Real failures that happened during this build and how the harness responded.

Feature #3: Dark Mode Toggle — SKIPPED after 3 attempts

SKIPPED

1. Attempt 1: Codex wrote ThemeToggle component + added import to layout.tsx

2. Build gate FAILED: TypeScript error — ThemeToggle import path wrong

3. Original harness bug: git checkout . wiped ALL files including the component Codex just wrote

4. Attempt 2: Codex rewrote from scratch, same import error — component deleted again on failure

5. Attempt 3: Same pattern. Max attempts reached → SKIPPED

Root Cause

The harness ran git checkout . && git clean -fd on build failure, nuking the agent's work before the retry could fix it. The agent kept recreating the component, but the revert kept deleting it.

Resolution

Removed the destructive revert. New behavior: keep the code on build failure, feed actual compiler errors into feedback.md so the next attempt can fix the specific issue instead of starting over.

Feature #1: Scaffold — scored 5/10 on first attempt

RECOVERED

1. Attempt 1: Codex scaffolded Next.js app but missed dark background, no Tailwind classes applied

2. Evaluator scored 5/10 — 'Homepage renders but missing dark background styling'

3. Feedback written to .ralph-logs/feedback.md with specific fix

4. Attempt 2: Codex read feedback, added bg-zinc-950 and Tailwind classes

5. Evaluator scored 9/10 → PASSED

Root Cause

First attempt was functional but visually incomplete. The evaluator caught what a self-evaluating agent would have marked as 'done.'

Resolution

This is the system working as designed — evaluator backpressure caught a quality issue and the builder fixed it on retry.

Feature #28: Loading Skeletons — 3 attempts, 419 seconds

RECOVERED

1. Attempt 1: Skeleton components exist but no pulse animation, wrong dimensions (4/10)

2. Attempt 2: Animation works but skeleton heights don't match actual content sections (6/10)

3. Attempt 3: All dimensions correct, pulse animation smooth, matches real content layout (9/10) → PASSED

Root Cause

Complex feature requiring pixel-level accuracy. Each evaluator pass caught progressively finer issues.

Resolution

Inner loop did exactly what it should: iterated from broken → functional → polished. Three attempts, each building on the previous.

Feature #13: Dual-Axis Chart — scored 5/10 then 8/10

RECOVERED

1. Attempt 1: Both score line and iteration bars rendered on the same Y-axis scale, making iterations invisible

2. Evaluator: 'Dual y-axis not configured, both series on same scale'

3. Attempt 2: Codex added right Y-axis for iterations, left for scores, legend distinguishes them

4. Evaluator scored 8/10 → PASSED

Root Cause

Recharts dual-axis configuration is non-obvious. The evaluator caught a usability issue a human reviewer would also catch.

Resolution

Specific feedback ('dual y-axis not configured') was actionable enough for the builder to fix in one revision.

The pattern: Volume without quality = 30 half-baked features. Quality without volume = 3 perfect features. The two-loop architecture gives you both: 29 features, average score 9.5/10, with real iteration on the hard ones.

Research Foundation

“Effective Harnesses for Long-Running Agents”

Justin Young et al., Anthropic · November 2025

Patterns used: JSON feature list, git-as-recovery, one-feature-per-iteration, browser verification

“Harness Design for Long-Running Application Development”

Prithvi Rajasekaran, Anthropic Labs · March 2026

Patterns used: GAN-inspired generator-evaluator separation, multi-dimension scoring, iterative quality improvement

“Autoresearch Loop (independent prior art)”

Ben Shyong · March 2026

Patterns used: Separate generator + evaluator models, bounded iteration with score tracking, measurable improvement (6.42 → 6.56)