Brokenbranch Lab — Erdős

The method is
the product

Can a curious generalist and an AI co-explore mathematics — with every reasoning step recorded, replayable, and independently verifiable? Frontier labs ship answers. Nobody ships the trail. Erdős is the trail.

Calculator + Ledger · v0.1 demo · first real run filed 2026-06-10

0provenance nodes
0node kinds in schema
0confidence layers
0re-runs needed to replay
01

Two peers, not a pipeline

The architecture is deliberately simple: a Calculator that cannot hallucinate, and a Ledger that cannot forget.

The Calculator

Domain-agnostic, step-rigorous. Every mathematical operation routes to a deterministic tool — symbolic, formal, numerical-with-bounds, or constraint-solving. The reasoning core proposes; the tools dispose.

Lean 4 + Mathlib · SymPy · Z3 / CVC5 · Arb / MPFR

The Ledger

Records every decision, branch, dead end, and micro-hypothesis as a node in a provenance DAG. Versions every state. Outputs paper-ready: confidence layers, methodology, full audit trail.

SQLite · ULID-keyed nodes · cached replay · confidence per step

The unsolved problems — extremal combinatorics first — are the proving ground, not the point. Whether a bound moves or not, the publishable artifact is the legible record of how a human–AI pair actually reasons.

02

A real run, clickable

This is the Ledger's first real provenance graph — not an illustration. The toy problem is Mantel's theorem (settled since 1907, on purpose: the run exists to exercise the Ledger, not the math). Click any node to read the actual record.

Real entries only — run 01KTSVGYP55N359QWRWY916XYF · replayed consistent · zero re-runs
hypothesis tool invocation tool result verification
Select a node
One hypothesis fans out to a SymPy closed-form branch and five Z3 exhaustive branches (n = 3…7); all six results converge on a verification node. Z3's maxima and ⌊n²/4⌋ agree everywhere: 2 · 4 · 6 · 9 · 12.
03

The audit trail audits itself

A provenance record you can't replay is a claim, not evidence. The Ledger's replay contract is strict: reload the run from cache alone — zero re-execution — and re-derive the DAG's consistency: every parent resolves to an earlier node, and the verification node's recorded claims must match the cached tool outputs it summarizes.

On its very first invocation, the replay reader caught a real bug: two nodes minted in the same millisecond had sorted child-before-parent (random ULID bits deciding the order). The fix — monotonic ULIDs — landed mid-run, and the second run replayed clean. The replay contract's value is now demonstrated, not asserted.

ai_self_reportedThe reasoning core's own probability estimate when it generated the step — recorded, not trusted.
tool_certaintyHard categorical: verified (a formal tool accepted it) · computed · estimated · n/a.
human_endorsedDid a human actively approve this checkpoint? true / false / not-yet-seen.
downstream_dependentHow many later verified nodes lean on this one — structural load-bearing-ness, computed from the DAG.

Four confidence layers per step, because a single "confidence number" is a lie for mathematical reasoning. Readers of the eventual methodology paper see at a glance which steps were genuinely verified and which were AI-asserted but downstream-trusted.

04

The first run happened in daylight

Erdős runs are autonomous night-shift sessions under a written contract — a clock, heartbeats every thirty minutes, sealed checkpoints, and hard rules about what counts as verified. The first one ran live, watched.

18:41
Session armed under the contract; scope: Ledger writer, first toy run, replay reader.
18:55
Ledger writer + first provenance DAG landed. toy/mantel-v0: PASS.
19:00
Replay reader landed — and immediately caught the ULID ordering bug from cache. Fixed; replay consistent, exit 0.
19:05
The workstation's memory-pressure guard fired: 0.4 GB of commit headroom left. The same condition once killed a 3 AM run with no warning; this time the contract caught it.
19:15
Clean early wrap: stretch goals gated as unverified rather than claimed, everything sealed and committed. Written ≠ run; only a printed exit code counts.

Wrapped clean — nothing lost

05

What's next

Lean 4 verification nodes (the truth oracle), checkpoint/resume for overnight batches, and then the real proving ground: published extremal-combinatorics problems with loose bounds and clean verifiers. Improved bound or documented dead end — both are wins when the trail is the product.

Open seat — mathematician collaborator

The single biggest known failure mode of this project is the solo trap. If you're a combinatorialist — PhD student is plenty — curious what auditable human–AI collaboration looks like from the inside, the lab wants to talk. The Ledger means you can check every step we claim. brokenbranch.dev has the contact trail.

Labs ship answers. This one ships the trail. ❦