codex-long-running-harness

Codex-first harness for long-running app development with sprint planning, evaluator loops, and benchmark snapshots.

codex-long-running-harness is a Codex-first development harness for tasks that are too large for a single conversational pass. It treats long-horizon coding as an iterative system problem: plan the work, execute a sprint, evaluate the output, checkpoint the result, then continue.

Why it matters

  • Long-running software work needs explicit state, not just longer prompts.
  • Evaluation has to be part of the loop, otherwise the system drifts silently.
  • Progress should be inspectable after each sprint, with artifacts that survive restarts.

Core workflow

goal -> planner -> sprint backlog -> generator -> evaluator -> snapshot -> next sprint

What this project emphasizes

  • Sprintized execution: large coding goals are decomposed into bounded work cycles.
  • Evaluator-driven iteration: quality gates are part of the runtime, not an afterthought.
  • Recoverability: intermediate results and benchmark snapshots make it possible to resume work instead of restarting from scratch.
  • Research-grade traceability: each iteration can be compared, audited, and improved systematically.

Why it is highlighted here

This repo is a stronger signal than an ordinary agent demo because it focuses on the real bottleneck: how to make an AI coding system work over time without becoming opaque or fragile.