Erdős — auditable AI×math discovery

The method is
the product

Can a curious generalist and an AI co-explore mathematics — with every reasoning step recorded, replayable, and independently verifiable? Frontier labs ship answers. Nobody ships the trail. Erdős is the trail.

Calculator + Ledger · v0.1 demo · first real run filed 2026-06-10

0provenance nodes

0node kinds in schema

0confidence layers

0re-runs needed to replay

Two peers, not a pipeline

The architecture is deliberately simple: a Calculator that cannot hallucinate, and a Ledger that cannot forget.

The Calculator

Domain-agnostic, step-rigorous. Every mathematical operation routes to a deterministic tool — symbolic, formal, numerical-with-bounds, or constraint-solving. The reasoning core proposes; the tools dispose.

Lean 4 + Mathlib · SymPy · Z3 / CVC5 · Arb / MPFR

The Ledger

Records every decision, branch, dead end, and micro-hypothesis as a node in a provenance DAG. Versions every state. Outputs paper-ready: confidence layers, methodology, full audit trail.

SQLite · ULID-keyed nodes · cached replay · confidence per step

The unsolved problems — extremal combinatorics first — are the proving ground, not the point. Whether a bound moves or not, the publishable artifact is the legible record of how a human–AI pair actually reasons.

A real run, clickable

This is the Ledger's first real provenance graph — not an illustration. The toy problem is Mantel's theorem (settled since 1907, on purpose: the run exists to exercise the Ledger, not the math). Click any node to read the actual record.

Real entries only — run 01KTSVGYP55N359QWRWY916XYF · replayed consistent · zero re-runs

hypothesis tool invocation tool result verification

Select a node

One hypothesis fans out to a SymPy closed-form branch and five Z3 exhaustive branches (n = 3…7); all six results converge on a verification node. Z3's maxima and ⌊n²/4⌋ agree everywhere: 2 · 4 · 6 · 9 · 12.

The audit trail audits itself

A provenance record you can't replay is a claim, not evidence. The Ledger's replay contract is strict: reload the run from cache alone — zero re-execution — and re-derive the DAG's consistency: every parent resolves to an earlier node, and the verification node's recorded claims must match the cached tool outputs it summarizes.

On its very first invocation, the replay reader caught a real bug: two nodes minted in the same millisecond had sorted child-before-parent (random ULID bits deciding the order). The fix — monotonic ULIDs — landed mid-run, and the second run replayed clean. The replay contract's value is now demonstrated, not asserted.

ai_self_reported	The reasoning core's own probability estimate when it generated the step — recorded, not trusted.
tool_certainty	Hard categorical: `verified` (a formal tool accepted it) · `computed` · `estimated` · `n/a`.
human_endorsed	Did a human actively approve this checkpoint? true / false / not-yet-seen.
downstream_dependent	How many later verified nodes lean on this one — structural load-bearing-ness, computed from the DAG.

Four confidence layers per step, because a single "confidence number" is a lie for mathematical reasoning. Readers of the eventual methodology paper see at a glance which steps were genuinely verified and which were AI-asserted but downstream-trusted.

The first run happened in daylight

Erdős runs are autonomous night-shift sessions under a written contract — a clock, heartbeats every thirty minutes, sealed checkpoints, and hard rules about what counts as verified. The first one ran live, watched.