Self-referential metrics dashboard

Token Economics

Token usage, cost breakdown, and model roles across the build.

GPT-5.3 Codex

Builder

Writes all application code via Codex CLI in --full-auto mode. Sandboxed to workspace writes only. One feature per invocation.

Mode

codex exec --full-auto

Timeout

7 min per feature

Cost

~$0.01/1K tokens

Output

Code + tests + build

Claude Sonnet 4

Evaluator

Scores each feature on 3 dimensions via API call. Returns structured JSON with pass/fail and specific feedback per dimension.

Dimensions

Complete / Visual / No Placeholders

Weights

40% / 30% / 30%

Cost

~$0.003/1K tokens

Output

Score + feedback JSON

Why separate models? The builder (Codex) optimizes for completing the feature. The evaluator (Sonnet) optimizes for finding flaws. Same model doing both leads to “confidently praising its own mediocre work.” Different model, fresh context = honest scoring.

Token Split

Loading metrics...

Cost per Feature

Shows whether the agent becomes more efficient over time.

Loading metrics...

Running Total & Projection

Loading metrics...