For 41 sessions the project polished a single equation it could never actually run. We rebuilt it the way real forecasting sciences work: a population of competing, runnable models on a leaderboard, scored by a frozen oracle, driven by an autonomous research loop. This is the operationalization of Asimov's idea — psychohistory as a mathematical system, not one closed-form solution.
The honest question every report ended on, that 41 sessions never answered: what is the formula's actual forecast skill? Now we measure it.
Why we were stuck
A ground-truth audit of the repository plus a 10-report research review converged on the same conclusion: the stagnation was architectural, not a matter of effort. An autonomous discovery loop needs four legs. The old design had none of them.
One living formula, refined by consensus.
There was exactly one CURRENT.md. Every session re-polished the same document — gradient descent from a single seed. Productive discovery needs a population of competing candidates (FunSearch, AlphaEvolve, ShinkaEvolve). With one model, the diversity term that powers every ensemble is zero by construction.
The gate was prose, not a number.
A formula update was approved by the Philosopher's written verdict. Prose cannot detect overfitting or compute calibration — and it selects for coherence with the existing formula, filtering out exactly the diverse variants an ensemble needs. Every serious system gates on an immutable scalar an agent cannot edit.
Agents read papers; they never ran code.
All nine lead agents and their 135 sub-agents only searched and scored literature. None fit a model, ran a simulation, or executed a backtest. The 36 'calibrated' parameters were argued into existence from papers, never fit to data — curve-fitting by citation.
Backtests were circular and qualitative.
Historical 'tests' were narrative precondition assessments scored PASS/PARTIAL, with the calibration cases reused as validation cases. No locked hold-out, no numeric score, no negative controls. After 40 sessions the state vector and solver were still undefined — so the formula had never produced a single numerical prediction.
The forks we chose
Scope
Clean-slate engine, keep the corpus
Rebuild the session engine around the model zoo; archive the single-formula apparatus; reuse the 74 accumulated parameters as priors.
Mathematics
A portfolio of tractable models now
Runnable models that can be backtested today (hazard models, regime logits, ABMs, ensembles). The grand unified PDE becomes one variant and a non-blocking long-horizon research track.
Compute
Agents that run code
A new tier of compute agents fit models, run simulations, and execute backtests every session. Literature research becomes an input, not the deliverable.
Targets
Dense, scoreable benchmarks
UCDP, V-Dem, PITF, the Cline Center coup data, Seshat, and IMF feed continuous, numerically-scored feedback. Polymarket becomes one signal among several.
Publication
Publish pre-resolution, labeled
Predictions are hash-locked before they are published, then shown live tagged by reflexivity class. The clean validation signal is held-out retrodiction; live markets are secondary.
Cadence
Nightly autonomous loop
A scheduled ratchet runs many fast backtest experiments each night: keep what improves the score, revert what does not.
The model zoo
Each variant is a genuinely different structural hypothesis with a runnable model — not a parameter tweak. They are scored identically by the frozen oracle and combined into an ensemble. The mandatory null baseline is the bar every variant must beat by discrimination.
null
Reference-class base rate only. The mandatory benchmark — beat it or you are noise.
regime logit
The Political Instability Task Force regime-type inverted-U: anocracies are most unstable.
scaffold — awaiting real features
empirical
Smoothed frequency learned from training events — a control that exposes data contamination.
structural-demographic
Turchin's Political Stress Index and secular cycles.
sociophysics
Random-field Ising model — the best-validated sociophysics result.
stochastic PDE
The demoted legacy 8-D formula — now one competitor among many.
network
Watts threshold cascades on empirical temporal networks.
The autonomous research loop
Karpathy's ratchet — mutable artifact, immutable scalar metric, keep-if-better, revert-if-not — generalized from one artifact to a whole population, with explicit safeguards against premature convergence. The pure-compute version is live: it runs nightly at 01:00, tuning parameters against training cross-validation. The agentic version, which invents genuinely new models, runs in /start sessions.
Select
A UCB1 bandit allocates the next session to the variant or discipline where skill is improving fastest (replacing the old 306 KB priority queue).
Mutate
A generator proposes a parameter change, a structural change, or a crossover between two variants' mechanisms.
Implement
A compute agent writes and runs the code — fits, simulates, backtests — producing real numbers.
Score
The frozen scorer returns one composite scalar on the locked hold-out. No agent can edit it.
Ratchet
A novelty filter and the Philosopher's falsifiability gate decide admission; keep the variant if the score improves, otherwise revert via git.
Anti-degeneracy safeguards
The frozen oracle
The single evaluation function for the whole project. Models are selected on this number and nothing else — because if an agent can edit the scorer, the scorer is worthless.
Integrity rules
Live leaderboard
Scored by the frozen oracle on a locked hold-out of 26 historical events including 10 negative controls (high-stress societies that did NOT collapse). The leaderboard is refreshed by the autonomous loop, which runs nightly at 01:00 and tunes parameters against TRAINING cross-validation only — it never touches the hold-out. The set is deliberately crisis-skewed, so the real test is resolution (discrimination), not a low average. PBO is high (≈0.5) on this little data, so treat any sub-chance Brier as an early signal, not a robust result.
| Model | Family | Brier | Resolution | Neg-ctrl | Tier |
|---|---|---|---|---|---|
| ensemble | equal-weight | 0.268 | 0.124 | 0.091 | T0 |
| pitf_logit | regime_logit | 0.219 | 0.139 | 0.188 | T1 |
| train_freq | empirical_frequency | 0.252 | 0.112 | 0.112 | T0 |
| null_baseline | null | 0.370 | 0.095 | 0.038 | T0 |
pitf_logit is the only variant under the chance line so far (Brier 0.22, best discrimination at resolution 0.14) — and it does that on regime type alone, before its four real PITF features are even wired in. The ensemble sits just above chance (0.27, Tier 0). The nightly loop tunes parameters against training cross-validation with a regularizer that keeps it near literature priors rather than chasing overfit extremes, so the numbers stay honest. PBO ≈ 0.5 on 26 events: promising, not yet robust. Real per-event features are the next step.
The progress ladder
A single objective metric, tracked every night: ensemble Brier on the frozen hold-out, against three reference lines. The legacy formula sits at Tier 0 — it cannot emit a probability at all. One variant (pitf_logit) already dips under the chance line, but the ensemble is still Tier 0 and PBO is high — so we don't yet claim robust Tier-1 skill, and we say so.
Produces a numeric prediction at all
—
legacy formula is here (cannot)
Beats chance
Brier < 0.25
current target
Beats market consensus
Brier < 0.18
future
Approaches superforecaster level
Brier < 0.15
genuine psychohistory progress
The research team, restructured
Evidence is clear that beyond a small panel, more agents degrade results. The old roster of 9 leads × 15 prose-only sub-agents is replaced by a lean two-class structure: generators who propose, and compute agents who run code. The Philosopher comes off the numeric gate and becomes the anti-self-deception auditor.
Lean research roles that retrieve literature, propose parameters and structural changes, and stress-test falsifiability. They feed the loop; they no longer ARE the loop.
The missing half: agents whose output is a number. fit-agent fits parameters to data, abm-agent runs agent-based simulations, macro-agent fits structural-demographic models, score-agent runs the frozen scorer (and cannot propose changes).
| Lead agent | Verdict | What changes |
|---|---|---|
| Statistical Physicist | KEEP + compute | Owns the model-zoo catalog and now writes the model code, not prose. |
| Bayesian Statistician | PROMOTE | Loop owner: owns the scorer, ensemble weights, and calibration. |
| Cliodynamicist | KEEP + compute | Owns the historical hold-out suite and ground-truth datasets. |
| Political Scientist | KEEP (thin) | Owns the PITF baseline variant that others must beat. |
| Econophysicist | MERGE → Macro | Owns the random-field Ising and power-law variants. |
| Network Scientist | MERGE → Meso | One Meso team with the sociologist; owns cascade dynamics. |
| Computational Sociologist | MERGE → Meso | Becomes the agent-based-model owner. |
| Behavioral Neuroscientist | CONSULT | Micro parameters have the lowest marginal value; consult, don't lead. |
| Evolutionary Psychologist | CONSULT | Constants rarely move a macro forecast; consult, don't lead. |
| Philosopher of Science | RE-SCOPE | Off the numeric gate; now the falsifiability + reflexivity + overfitting auditor. |
Validation, rebuilt
A locked hold-out
Twenty-plus events including negative controls, which the model-building agents are forbidden to read.
Numeric Brier, not PASS/FAIL
Every retrodiction emits a probability and is scored. Narrative precondition assessments are gone.
Leakage-safe backtesting
Purged k-fold with an embargo longer than the cycle being modeled; standard k-fold is banned.
Pre-registration
The probability is hash-locked before the outcome can be read — technical, not procedural, honesty.
Severe testing
Every variant admitted to the zoo carries a pre-specified falsification criterion (Mayo severity ≥ 0.8).
Reflexivity audit
Each published prediction is classified immune, self-fulfilling, or self-defeating — the Seldon problem, handled explicitly.
What's next
No variant can reach Tier 1 by discrimination until compute agents fetch real per-event features from V-Dem, the Cline Center, UCDP, and Seshat. pitf_logit already shows the best resolution on regime type alone — wiring in its four real features (infant mortality, regime durability, factionalism, neighboring conflict) is the highest-value next move. Then the nightly loop and the full agent restructuring come online.