The Machine — Learn — A Second Foundation

The model zoo

If you ask this project for “the formula,” you will not get an equation. You will get a directory of small programs. Each one is a complete, runnable model of how large-scale political crisis happens — a coup, a revolution, a state collapse — and they compete against each other. That directory is the model zoo. As of this writing it holds 10 variants.

What a variant is

A variant is three things bound together: a runnable program that takes measurable facts about a country and a year and returns a probability; a written structural hypothesis — the one-sentence claim about how the world works that the program encodes; and an admission status that records how far that claim has survived validation. If any of the three is missing, it is not a variant — it is an opinion.

The measurable facts are called features: infant mortality, regime type, the share of young adults in the population, whether a neighboring state is at war. Every feature value must trace back to a row in a downloaded dataset. If a value is unknown, it stays unknown and the model must cope. No variant is permitted to invent its own evidence.

Why a population instead of one grand model

This project used to be one grand equation, maintained by consensus — every discipline negotiating its term into a single formula. The redesign of June 2026 abandoned that, for three reasons.

A population is cheap to kill. When a single grand model fails a test, you cannot discard it — you patch it, and every patch is an argument. When a zoo variant fails, it is simply marked failed, the failure is recorded, and the other variants keep running. Killing bad ideas quickly is most of what progress is.

A population is comparable. Every variant takes the same exam: the same sealed events, the same frozen grading function, one shared leaderboard. Two mechanisms that would otherwise be debated in prose are instead made to predict, and the predictions are scored.

Mechanisms compete instead of being averaged. In a single equation, rival theories get blended into one term by committee. In a zoo, the theory that elite overproduction drives instability and the theory that regime type drives instability each get their own program — and the world gets to referee.

What the variants actually claim

Three examples make the idea concrete. One variant knows nothing except history’s base rates: for any event it is shown, it answers with the long-run frequency of similar events in its reference class — roughly, “coups in countries like this happen about this often.” It encodes no mechanism at all. It exists to be beaten: any model that cannot outperform it has learned nothing about the world.

Another family encodes the central finding of the Political Instability Task Force, a long-running US government forecasting effort: that partially democratic, factionalized regimes — not full democracies, not hardened autocracies — are the most crisis-prone, with poor material conditions and a war next door raising the hazard further.

A third implements Peter Turchin’s structural-demographic theory: that instability builds when a society produces more elite aspirants than elite positions, when ordinary living standards erode, and when the state’s finances are under stress — pressures that accumulate over decades before they release.

The current roster, live from the leaderboard data: null_baseline, train_freq, pitf_logit, firth_logit, sdt_turchin, hierarchical_bayes, hazard_spline, reign_logit, gbm_honest, conformal_wrapper. Each variant’s full hypothesis and status is documented on the redesign page; here it is enough to see that they are genuinely different theories, not ten flavors of the same one.

Admitted vs experimental

Every variant on the board carries one of two statuses. Admitted variants belong to the official ensemble — the small set of models whose combined output is the project’s actual forecast, the number it stakes its reputation on. Experimental variants are on the leaderboard, scored every night, fully visible — and excluded. They are candidates, not members. Right now the board carries 2 admitted and 8 experimental variants.

Two rules govern the boundary, and both are stricter than you might expect. First, nothing built overnight ever enters the ensemble automatically. The autonomous pipeline can build, tune, and score — it cannot admit. Second, leaderboard rank is irrelevant to admission. As of June 2026 the variant with the best raw Brier score on the entire board — pitf_logit, at 0.1745 — is not admitted. It is falsified: its pre-registered test failed, three times, and the register says so permanently.

Why would a top score not be enough? Because a flattering average can be earned without learning anything — by hedging every answer toward the base rate, or by riding a lucky draw of test events. Chapter 3 is about exactly that failure mode. The only road to admission runs through a severe test — a check pre-registered before the evidence arrives, built so that a false mechanism would very probably fail it — and then through the Philosopher gate, an adversarial review of the whole claim.

How a model becomes official

Built

A new variant lands in the zoo — from an overnight build or a session worker.

Experimental

Listed on the leaderboard, excluded from the official ensemble. Rank changes nothing.

Severe test

A pre-registered test, written before the evidence arrives, designed to fail if the mechanism is not real.

Philosopher gate

An adversarial judge checks the test result, the data hygiene, and the structural claim.

Admitted

Joins the official ensemble — the only models whose forecasts count as ours.

↓ fail at either checkpoint

Falsified

The failure is recorded in the falsification register — append-only, never deleted. The variant stays runnable but excluded. Re-entry requires a materially different mechanism, not a retry.

Note the side exit. Falsification is not deletion: a falsified variant stays runnable and visible, but its structural claim is dead on the record, and re-entry requires a materially different mechanism — new structure, new evidence — not a second attempt at the same test.

The nightly loop

At 01:00 every night, with no human present, a scheduled pipeline runs on the project machine. Four of its five stages are pure computation — no AI, no network judgment calls. One stage is an AI build, and it is deliberately kept in a cage.

Every night at 01:00 — no operator present

Content refresh

Sync the public site’s numbers from the last settled cycle. No deploy, no judgment.

pure compute

Ratchet

Tune each model’s free parameters on training data only. A champion is replaced only by something strictly better.

pure compute

Grammar sweep

Mass-test thousands of candidate model structures against luck. Survivors are labeled hypotheses, never discoveries.

pure compute

Bounded evolve

ONE AI build per night: $3 hard cap, 45-minute timeout, tamper checks on protected files, a kill-switch file.

bounded LLM

Resolution check

Check whether registered real-world predictions have resolved; settle them on an append-only scoreboard.

pure compute

Nothing the night produces enters the official ensemble. Overnight builds wake up experimental and wait for the next operator session.

The ratchet. The ratchet tunes each variant’s free parameters — the adjustable dials inside an otherwise fixed structure. It searches using training data only; the sealed hold-out is never consulted to make a decision. And it is keep-if-better: a variant’s current champion parameters are replaced only by a candidate that is strictly superior. Like a mechanical ratchet, it can turn forward or hold — it cannot slip back.

The grammar sweep. The grammar sweep searches over model structures rather than dials: which features, which transformations, which mathematical form, in combination — thousands of candidates per run. Mass search finds lucky garbage by default, so every promising candidate is checked against a permutation null — the same search run on deliberately scrambled data, to measure what pure luck scores — and the survivors are filtered with a false-discovery-rate correction that accounts for how many things were tried. Whatever passes is labeled a hypothesis, never a discovery. In the first sweep, in June 2026, 11 of 236 candidates survived; they became entries in a queue, not claims about the world.

The bounded evolve. Once per night, the pipeline asks an AI model to build or mutate exactly one variant, drawn from that hypothesis queue. The boundedness is the design philosophy, not an afterthought: a $3 hard spending cap, a 45-minute timeout, and cryptographic hash checks on every protected file — if the agent touches the frozen scorer, the sealed data, or any other guarded file, the change is automatically reverted and the build quarantined. A single kill-switch file, when present, turns the stage off entirely. And whatever the agent builds wakes up experimental. An autonomous researcher you cannot bound is an autonomous researcher you cannot trust.

The resolution check. Finally, the pipeline checks whether any registered real-world predictions have resolved — did the forecasted event happen or not — and settles them on a scoreboard that is append-only. Wins and losses are written with the same permanence; neither can be quietly edited later.

The seven roles

The night handles volume. Judgment — choosing what to build, deciding what the evidence means, admitting models — runs through operator sessions staffed by seven AI roles, each owning a concrete artifact:

Orchestrator — runs each session: verifies the scorer’s integrity, reads the board and the queue, and chooses one move.
Data Engineer — fetches real datasets and wires them into features, working outcome-blind: it never sees what happened while coding what was true beforehand.
ML / Forecasting Engineer — builds and modifies variants and grooms the hypothesis queue.
Demographer — owns population-structure features: youth bulges, urbanization, age pyramids.
Red-Team Forecaster — writes the strongest case against any prediction before it is allowed to be registered.
Calibration Auditor — reads the leaderboard and the nightly logs for drift, inconsistency, and signs of self-deception.
Philosopher of Science — judges. It owns the falsification register and the admission gate, and nothing else.

The structure is a separation of powers. Proposers never grade: any role that builds or fits models is forbidden from reading the sealed test data, and none of them may touch the scoring function. Graders never propose: the Philosopher judges work but never selects it, because a judge who chooses the projects acquires a stake in their success.

There is one more discipline, easy to miss: every worker is cold-briefed. It receives one bounded task, the files it needs, the integrity rules, and the exact format of its answer — and nothing else. It does not inherit the session’s running context. This is deliberate hygiene, like an examiner who has never met the student: a worker that inherits the orchestrator’s enthusiasm for a hypothesis inherits its biases too.

The Philosopher gate

The last checkpoint before anything becomes official is adversarial by construction. The Philosopher of Science — run on the most capable model available, because judgment is the one place capability is not optional — evaluates every change that matters to the leaderboard against the falsification register: a file of pre-registered criteria, each one written down before the evidence it governs arrives.

The criteria are locked. A register entry may later be marked PASSED or FAILED, but its thresholds may never be weakened after the fact. There is no “the test was too strict” and no quiet redefinition of success — the goalposts are bolted down in advance, in writing, in a file that only grows.

And the Philosopher only judges. It never proposes variants, never selects the night’s work, never decides what to build next. Its verdict does exactly three things: it admits a variant to the official ensemble; or it records a falsification — permanently, as it did for the fixed-prior PITF mechanism in June 2026; or it leaves the variant experimental, awaiting evidence that does not yet exist.

The machine generates candidates and evidence at scale. Nothing becomes official without surviving an adversary whose rules were written first.

That asymmetry is the whole design. Throughput is allowed to be autonomous, fast, and cheap, because the gate is none of those things — it is slow, adversarial, and bound by rules it cannot rewrite. The next chapter explains the grading itself: what a probability forecast is, and how you score one honestly.

What to remember

✓“The formula” is a population of competing, runnable models — each one a program plus a written hypothesis, cheap to falsify individually and scored together on one board.
✓Admitted and experimental are different worlds: only admitted models speak for the project, and leaderboard rank alone never admits anything.
✓Every night an autonomous pipeline tunes parameters, mass-tests structures against luck, and builds one new candidate — and everything it produces wakes up experimental.
✓The single nightly AI build is deliberately caged: a hard cost cap, a timeout, tamper detection on protected files, and a kill switch.
✓Powers are separated: proposers never grade, graders never propose, and the Philosopher judges against falsification criteria that were locked before the evidence arrived.