A Second Foundation
Architecture redesignJune 6, 2026

From One Formula to a System of Models

For 41 sessions the project polished a single equation it could never actually run. We rebuilt it the way real forecasting sciences work: a population of competing, runnable models on a leaderboard, scored by a frozen oracle, driven by an autonomous research loop. This is the operationalization of Asimov's idea — psychohistory as a mathematical system, not one closed-form solution.

The honest question every report ended on, that 41 sessions never answered: what is the formula's actual forecast skill? Now we measure it.

Why we were stuck

A research engine that never computed could only accumulate citations

A ground-truth audit of the repository plus a 10-report research review converged on the same conclusion: the stagnation was architectural, not a matter of effort. An autonomous discovery loop needs four legs. The old design had none of them.

01

No population

One living formula, refined by consensus.

There was exactly one CURRENT.md. Every session re-polished the same document — gradient descent from a single seed. Productive discovery needs a population of competing candidates (FunSearch, AlphaEvolve, ShinkaEvolve). With one model, the diversity term that powers every ensemble is zero by construction.

02

No scalar oracle

The gate was prose, not a number.

A formula update was approved by the Philosopher's written verdict. Prose cannot detect overfitting or compute calibration — and it selects for coherence with the existing formula, filtering out exactly the diverse variants an ensemble needs. Every serious system gates on an immutable scalar an agent cannot edit.

03

No execution

Agents read papers; they never ran code.

All nine lead agents and their 135 sub-agents only searched and scored literature. None fit a model, ran a simulation, or executed a backtest. The 36 'calibrated' parameters were argued into existence from papers, never fit to data — curve-fitting by citation.

04

No held-out data

Backtests were circular and qualitative.

Historical 'tests' were narrative precondition assessments scored PASS/PARTIAL, with the calibration cases reused as validation cases. No locked hold-out, no numeric score, no negative controls. After 40 sessions the state vector and solver were still undefined — so the formula had never produced a single numerical prediction.

The forks we chose

Six decisions that define the new system

Scope

Clean-slate engine, keep the corpus

Rebuild the session engine around the model zoo; archive the single-formula apparatus; reuse the 74 accumulated parameters as priors.

Mathematics

A portfolio of tractable models now

Runnable models that can be backtested today (hazard models, regime logits, ABMs, ensembles). The grand unified PDE becomes one variant and a non-blocking long-horizon research track.

Compute

Agents that run code

A new tier of compute agents fit models, run simulations, and execute backtests every session. Literature research becomes an input, not the deliverable.

Targets

Dense, scoreable benchmarks

UCDP, V-Dem, PITF, the Cline Center coup data, Seshat, and IMF feed continuous, numerically-scored feedback. Polymarket becomes one signal among several.

Publication

Publish pre-resolution, labeled

Predictions are hash-locked before they are published, then shown live tagged by reflexivity class. The clean validation signal is held-out retrodiction; live markets are secondary.

Cadence

Nightly autonomous loop

A scheduled ratchet runs many fast backtest experiments each night: keep what improves the score, revert what does not.

The model zoo

The formula is now a population of competing structures

Each variant is a genuinely different structural hypothesis with a runnable model — not a parameter tweak. They are scored identically by the frozen oracle and combined into an ensemble. The mandatory null baseline is the bar every variant must beat by discrimination.

null_baselinelive

null

Reference-class base rate only. The mandatory benchmark — beat it or you are noise.

pitf_logitlive

regime logit

The Political Instability Task Force regime-type inverted-U: anocracies are most unstable.

scaffold — awaiting real features

train_freqlive

empirical

Smoothed frequency learned from training events — a control that exposes data contamination.

sdt_turchinplanned

structural-demographic

Turchin's Political Stress Index and secular cycles.

rfim_electionplanned

sociophysics

Random-field Ising model — the best-validated sociophysics result.

fokker_planckplanned

stochastic PDE

The demoted legacy 8-D formula — now one competitor among many.

network_cascadeplanned

network

Watts threshold cascades on empirical temporal networks.

The autonomous research loop

Hypothesize, implement, backtest, score, select

Karpathy's ratchet — mutable artifact, immutable scalar metric, keep-if-better, revert-if-not — generalized from one artifact to a whole population, with explicit safeguards against premature convergence. The pure-compute version is live: it runs nightly at 01:00, tuning parameters against training cross-validation. The agentic version, which invents genuinely new models, runs in /start sessions.

1

Select

A UCB1 bandit allocates the next session to the variant or discipline where skill is improving fastest (replacing the old 306 KB priority queue).

2

Mutate

A generator proposes a parameter change, a structural change, or a crossover between two variants' mechanisms.

3

Implement

A compute agent writes and runs the code — fits, simulates, backtests — producing real numbers.

4

Score

The frozen scorer returns one composite scalar on the locked hold-out. No agent can edit it.

5

Ratchet

A novelty filter and the Philosopher's falsifiability gate decide admission; keep the variant if the score improves, otherwise revert via git.

Anti-degeneracy safeguards

Novelty filter (reject near-duplicate variants)Island model (preserve diversity)Purged + embargoed scoring (no temporal leakage)Proposer ≠ evaluator (no gaming your own metric)PBO / deflated metrics (multiple-testing honesty)

The frozen oracle

One number no agent can edit

The single evaluation function for the whole project. Models are selected on this number and nothing else — because if an agent can edit the scorer, the scorer is worthless.

Integrity rules

  • Frozen and hash-verified: integrity is checked before any score is trusted.
  • The hold-out set is proposer-forbidden — only the scorer reads the outcomes.
  • All out-of-sample evaluation uses purged + embargoed cross-validation; standard k-fold is banned.
  • Predictions are pre-registered: the probability is locked before the outcome can be read.
  • Leaderboard claims are deflated by the number of variants tried.

Live leaderboard

Where the variants actually stand

Scored by the frozen oracle on a locked hold-out of 26 historical events including 10 negative controls (high-stress societies that did NOT collapse). The leaderboard is refreshed by the autonomous loop, which runs nightly at 01:00 and tunes parameters against TRAINING cross-validation only — it never touches the hold-out. The set is deliberately crisis-skewed, so the real test is resolution (discrimination), not a low average. PBO is high (≈0.5) on this little data, so treat any sub-chance Brier as an early signal, not a robust result.

ModelFamilyBrierResolutionNeg-ctrlTier
ensembleequal-weight0.268
0.124
0.091T0
pitf_logitregime_logit0.219
0.139
0.188T1
train_freqempirical_frequency0.252
0.112
0.112T0
null_baselinenull0.370
0.095
0.038T0
Hold-out events: 26Negative controls: 10Legacy formula: Tier 0Chance line: 0.25Market line: 0.18PBO: 0.54

pitf_logit is the only variant under the chance line so far (Brier 0.22, best discrimination at resolution 0.14) — and it does that on regime type alone, before its four real PITF features are even wired in. The ensemble sits just above chance (0.27, Tier 0). The nightly loop tunes parameters against training cross-validation with a regularizer that keeps it near literature priors rather than chasing overfit extremes, so the numbers stay honest. PBO ≈ 0.5 on 26 events: promising, not yet robust. Real per-event features are the next step.

The progress ladder

What counts as real progress

A single objective metric, tracked every night: ensemble Brier on the frozen hold-out, against three reference lines. The legacy formula sits at Tier 0 — it cannot emit a probability at all. One variant (pitf_logit) already dips under the chance line, but the ensemble is still Tier 0 and PBO is high — so we don't yet claim robust Tier-1 skill, and we say so.

Tier 0

Produces a numeric prediction at all

legacy formula is here (cannot)

Tier 1

Beats chance

Brier < 0.25

current target

Tier 2

Beats market consensus

Brier < 0.18

future

Tier 3

Approaches superforecaster level

Brier < 0.15

genuine psychohistory progress

The research team, restructured

From 135 readers to a lean team that computes

Evidence is clear that beyond a small panel, more agents degrade results. The old roster of 9 leads × 15 prose-only sub-agents is replaced by a lean two-class structure: generators who propose, and compute agents who run code. The Philosopher comes off the numeric gate and becomes the anti-self-deception auditor.

Class A — Generators

Lean research roles that retrieve literature, propose parameters and structural changes, and stress-test falsifiability. They feed the loop; they no longer ARE the loop.

Class B — Compute agents

The missing half: agents whose output is a number. fit-agent fits parameters to data, abm-agent runs agent-based simulations, macro-agent fits structural-demographic models, score-agent runs the frozen scorer (and cannot propose changes).

Lead agentVerdictWhat changes
Statistical PhysicistKEEP + computeOwns the model-zoo catalog and now writes the model code, not prose.
Bayesian StatisticianPROMOTELoop owner: owns the scorer, ensemble weights, and calibration.
CliodynamicistKEEP + computeOwns the historical hold-out suite and ground-truth datasets.
Political ScientistKEEP (thin)Owns the PITF baseline variant that others must beat.
EconophysicistMERGE → MacroOwns the random-field Ising and power-law variants.
Network ScientistMERGE → MesoOne Meso team with the sociologist; owns cascade dynamics.
Computational SociologistMERGE → MesoBecomes the agent-based-model owner.
Behavioral NeuroscientistCONSULTMicro parameters have the lowest marginal value; consult, don't lead.
Evolutionary PsychologistCONSULTConstants rarely move a macro forecast; consult, don't lead.
Philosopher of ScienceRE-SCOPEOff the numeric gate; now the falsifiability + reflexivity + overfitting auditor.

Validation, rebuilt

How we keep ourselves honest

A locked hold-out

Twenty-plus events including negative controls, which the model-building agents are forbidden to read.

Numeric Brier, not PASS/FAIL

Every retrodiction emits a probability and is scored. Narrative precondition assessments are gone.

Leakage-safe backtesting

Purged k-fold with an embargo longer than the cycle being modeled; standard k-fold is banned.

Pre-registration

The probability is hash-locked before the outcome can be read — technical, not procedural, honesty.

Severe testing

Every variant admitted to the zoo carries a pre-specified falsification criterion (Mayo severity ≥ 0.8).

Reflexivity audit

Each published prediction is classified immune, self-fulfilling, or self-defeating — the Seldon problem, handled explicitly.

What's next

Feature-fetching is the critical path to Tier 1

No variant can reach Tier 1 by discrimination until compute agents fetch real per-event features from V-Dem, the Cline Center, UCDP, and Seshat. pitf_logit already shows the best resolution on regime type alone — wiring in its four real features (infant mortality, regime durability, factionalism, neighboring conflict) is the highest-value next move. Then the nightly loop and the full agent restructuring come online.