Can a curious generalist and an AI co-explore mathematics — with every reasoning step recorded, replayable, and independently verifiable? Frontier labs ship answers. Nobody ships the trail. Erdős is the trail.
Calculator + Ledger · v0.1 demo · first real run filed 2026-06-10
The architecture is deliberately simple: a Calculator that cannot hallucinate, and a Ledger that cannot forget.
Domain-agnostic, step-rigorous. Every mathematical operation routes to a deterministic tool — symbolic, formal, numerical-with-bounds, or constraint-solving. The reasoning core proposes; the tools dispose.
Records every decision, branch, dead end, and micro-hypothesis as a node in a provenance DAG. Versions every state. Outputs paper-ready: confidence layers, methodology, full audit trail.
The unsolved problems — extremal combinatorics first — are the proving ground, not the point. Whether a bound moves or not, the publishable artifact is the legible record of how a human–AI pair actually reasons.
This is the Ledger's first real provenance graph — not an illustration. The toy problem is Mantel's theorem (settled since 1907, on purpose: the run exists to exercise the Ledger, not the math). Click any node to read the actual record.
Real entries only — run 01KTSVGYP55N359QWRWY916XYF · replayed consistent · zero re-runsA provenance record you can't replay is a claim, not evidence. The Ledger's replay contract is strict: reload the run from cache alone — zero re-execution — and re-derive the DAG's consistency: every parent resolves to an earlier node, and the verification node's recorded claims must match the cached tool outputs it summarizes.
On its very first invocation, the replay reader caught a real bug: two nodes minted in the same millisecond had sorted child-before-parent (random ULID bits deciding the order). The fix — monotonic ULIDs — landed mid-run, and the second run replayed clean. The replay contract's value is now demonstrated, not asserted.
| ai_self_reported | The reasoning core's own probability estimate when it generated the step — recorded, not trusted. |
| tool_certainty | Hard categorical: verified (a formal tool accepted it) · computed · estimated · n/a. |
| human_endorsed | Did a human actively approve this checkpoint? true / false / not-yet-seen. |
| downstream_dependent | How many later verified nodes lean on this one — structural load-bearing-ness, computed from the DAG. |
Four confidence layers per step, because a single "confidence number" is a lie for mathematical reasoning. Readers of the eventual methodology paper see at a glance which steps were genuinely verified and which were AI-asserted but downstream-trusted.
Erdős runs are autonomous night-shift sessions under a written contract — a clock, heartbeats every thirty minutes, sealed checkpoints, and hard rules about what counts as verified. The first one ran live, watched.
Wrapped clean — nothing lost
Lean 4 verification nodes (the truth oracle), checkpoint/resume for overnight batches, and then the real proving ground: published extremal-combinatorics problems with loose bounds and clean verifiers. Improved bound or documented dead end — both are wins when the trail is the product.
The single biggest known failure mode of this project is the solo trap. If you're a combinatorialist — PhD student is plenty — curious what auditable human–AI collaboration looks like from the inside, the lab wants to talk. The Ledger means you can check every step we claim. brokenbranch.dev has the contact trail.
Labs ship answers. This one ships the trail. ❦