A Second Foundation
LearnChapter 07 / 07

Glossary

Every recurring term on this site, defined in plain language, with links to the chapter that explains it in depth.

A

Ablation

Removing one piece of a model — usually a group of input features — and re-scoring it. If the score barely changes without the piece, the piece was not actually doing anything. The project uses pre-registered ablation tests to check whether a feature genuinely adds discrimination or just looks decorative.

Explained in depth: Chapter 04 The Falsification Register

Admitted

The status of a model variant that has passed its pre-registered severe test and the Philosopher gate, earning a place in the official ensemble. Contrast with experimental.

Explained in depth: Chapter 02 Admitted vs experimental

B

Base rate

How often events of a given type happen historically, before you know anything specific about the case at hand. If roughly 4% of autocracies experience a regime collapse in any given year, that 4% is the base rate — the starting point every honest forecast must justify departing from.

Explained in depth: Chapter 06 Base rates

Benjamini–Hochberg FDR

A statistical correction used when testing many hypotheses at once. If you test 236 candidate model structures, a handful will look good purely by luck. The Benjamini–Hochberg procedure controls the false discovery rate: of the candidates it lets through, only a chosen fraction (say 10%) are expected to be flukes. Survivors are labeled hypotheses, never discoveries.

Explained in depth: Chapter 04 Mass search, honestly

Bootstrap confidence interval

An uncertainty range computed by re-scoring a model thousands of times on random re-samples of the test events. If two models' intervals overlap heavily, the data cannot tell them apart — whatever the raw ranking says.

Explained in depth: Chapter 05 Overlapping uncertainty

Brier score

The main grade for a probability forecast: the squared gap between what you predicted (0 to 1) and what happened (0 or 1), averaged over all events. Lower is better. Always guessing 50% scores 0.25 — that is what chance looks like. Roughly 0.18 matches prediction markets; below 0.15 approaches superforecaster level.

Explained in depth: Chapter 03 The Brier score

C

Calibration

Whether your stated probabilities match reality on average: of all the times you said “30%”, did the event happen about 30% of the time? A model can be perfectly calibrated and still useless — always predicting the base rate is calibrated but tells you nothing about which case is the risky one.

Explained in depth: Chapter 03 Calibration vs resolution

Confound

A hidden variable that explains an apparent relationship. Example from this project: a feature set can seem to improve predictions simply because its data coverage is better in the modern era — the “signal” is really just “which decade is this”. Severe tests are designed to catch confounds before a mechanism gets credit.

Explained in depth: Chapter 04 The Falsification Register

E

Embargo

In time-series testing, a buffer of years excluded from training immediately after the test window. Without it, a model can cheat by learning from the aftermath of the very events it is being tested on.

Explained in depth: Chapter 04 Purged cross-validation

Ensemble

The official prediction model: the combination of all admitted variants. Only variants that survive a pre-registered severe test and the Philosopher gate join it. Nightly experimental builds never enter automatically.

Explained in depth: Chapter 02 Admitted vs experimental

EVT deflation

A haircut applied to the best score on the leaderboard to account for luck. If you try N models, the best one will beat chance by roughly √(2·ln N) standard deviations of pure noise even if none of them has any skill. Subtracting that expected lucky gap turns the raw best score into an honest one. It is why the board can show a raw best Brier well below chance while the deflated headline still reads “no evidence of skill”.

Explained in depth: Chapter 04 The luck of the draw

Experimental

The status of a model variant that runs on the leaderboard but has not passed admission: it either awaits its severe test or has failed one. Experimental variants are excluded from the official ensemble no matter how good their raw scores look.

Explained in depth: Chapter 02 Admitted vs experimental

F

FALSIFIED

The status of a causal mechanism that failed its locked severe test. A falsified mechanism is recorded permanently in the Falsification Register and cannot re-enter through the same door — re-entry requires a materially different mechanism. Falsification is treated as progress: the project now knows something it did not.

Explained in depth: Chapter 04 The Falsification Register

Feature

An input a model reads for each country and year — infant mortality, regime type, elite factionalism, neighboring conflict, and so on. Every feature value traces to a real downloaded dataset row; unknowns stay empty rather than being invented.

Explained in depth: Chapter 02 The model zoo

FETCHED-UNWIRED

The status of a mechanism whose data has been downloaded and prepared but which no model variant consumes yet. It marks work waiting to happen, not evidence of anything.

Explained in depth: Chapter 04 The Falsification Register

Frozen scorer

The single program allowed to produce official numbers, locked with a cryptographic fingerprint (SHA-256). No agent may edit it; anyone can verify it has not changed. It exists because the most common failure of autonomous research systems is quietly redefining the test they are graded on.

Explained in depth: Chapter 04 The frozen scorer

G

Grammar sweep

The nightly mass search over thousands of candidate model structures — combinations of feature subsets, transforms, model bases, and link functions. Every candidate faces a permutation null and FDR correction, and survivors are queued as hypotheses for human-gated review, never auto-admitted.

Explained in depth: Chapter 02 The nightly loop

H

HARKing

“Hypothesizing After the Results are Known” — inventing the explanation after peeking at the answer, then presenting it as if it were predicted in advance. The entire integrity apparatus (sealed hold-outs, pre-registration, locked criteria) exists to make HARKing structurally difficult.

Explained in depth: Chapter 03 The trap we fell into

Hazard

The probability that an event — a coup, a collapse, the outbreak of war — occurs in a given year for a given country. Most models in the zoo work by estimating a hazard and adjusting it up or down based on features.

Explained in depth: Chapter 03 Forecasts as probabilities

Hold-out

A sealed set of test cases that model-building agents are forbidden to read. The current hold-outs: 26 curated historical events, and 1,198 country-years covering 2010–2015, sealed with a cryptographic fingerprint before any model fit them. Reading the hold-out to inform model design is the cardinal sin — it voids the project's claims.

Explained in depth: Chapter 04 Sealed hold-outs

L

LIVE-UNTESTED

The status of a mechanism that is wired into a running model but has not yet faced its severe test. It produces numbers, but the project does not yet claim those numbers mean anything.

Explained in depth: Chapter 04 The Falsification Register

Log-loss

A companion score to the Brier score that punishes confident wrong answers much more brutally: predicting 99% on an event that does not happen is catastrophic under log-loss. Useful as a cross-check because models can look acceptable on one score and terrible on the other.

Explained in depth: Chapter 03 The Brier score

M

Model zoo

The population of competing, runnable forecasting models that replaced the single grand formula in June 2026. Each variant encodes one structural hypothesis about how crises happen; they compete on a frozen leaderboard, and only gated winners join the official ensemble.

Explained in depth: Chapter 02 The model zoo

N

Negative control

A test case chosen because the event did not happen — a society under severe stress that nevertheless held together. A model with real signal should push probabilities down on these. If a change improves the headline score while making negative controls worse, that is the signature of a statistical artifact, not understanding.

Explained in depth: Chapter 03 Negative controls

P

PBO (Probability of Backtest Overfitting)

An estimate of the chance that the leaderboard's best performer is best by luck rather than skill. It is computed by repeatedly splitting the data, picking the winner on one half, and checking whether it stays the winner on the other. A PBO above 0.5 means the apparent winner more likely than not would not repeat.

Explained in depth: Chapter 05 Overlapping uncertainty

Permutation null

A way to measure what pure luck looks like: keep the model's predictions fixed, shuffle the real outcomes at random many times, and record the scores. The spread of those shuffled scores is the luck bandwidth — any claimed skill has to clear it.

Explained in depth: Chapter 04 The luck of the draw

Philosopher gate

The adversarial admission step: an AI agent acting as a philosopher of science judges every leaderboard-relevant change against locked falsification criteria. It can mark criteria PASSED or FAILED but can never weaken them, and it judges only — it never selects what gets built.

Explained in depth: Chapter 02 The Philosopher gate

Pre-registration

Committing to a prediction or a test criterion before the outcome can be known, in a form that cannot be quietly edited later (hash-locked, append-only). It is the difference between calling the shot and explaining the shot afterwards.

Explained in depth: Chapter 06 Locked before the outcome

Proposer-forbidden

The separation-of-powers rule: any agent that proposes or fits models is forbidden from reading the hold-out data. The proposer suggests; the frozen scorer judges; nobody grades their own homework.

Explained in depth: Chapter 04 Sealed hold-outs

Purged cross-validation

Cross-validation adapted for historical data. Standard k-fold testing leaks information across time — training data from after the test events tells the model how the story ends. Purging removes training cases that overlap the test window; an embargo removes the years right after it. Standard k-fold is banned in this project.

Explained in depth: Chapter 04 Purged cross-validation

R

Ratchet

The nightly parameter-tuning stage: a search algorithm (differential evolution) tries hundreds of parameter settings per model on training data only, keeping a change only if it strictly improves the purged cross-validation score. Like a mechanical ratchet, it can move forward but is built not to slip back — and it never touches the hold-out.

Explained in depth: Chapter 02 The nightly loop

Reference class

The group of historical cases a new case is compared against — for example, “autocracies with factional elites”. The base rate of that group is the honest starting forecast before any model-specific reasoning.

Explained in depth: Chapter 06 Base rates

Reflexivity

When publishing a forecast changes the thing being forecast. Every registered prediction is labeled by class: immune (the forecast cannot move the outcome), weakly self-fulfilling (belief in it nudges the outcome closer), or self-defeating (the warning triggers prevention). Reflexivity is why market performance is a secondary signal, not the main validation.

Explained in depth: Chapter 06 Reflexivity

Resolution

The part of forecast skill that measures discrimination: does the model assign meaningfully higher probabilities to the cases where the event happened than to the cases where it did not? Higher is better. A model can improve its Brier score with zero resolution by hedging toward the base rate — which is why this project never accepts Brier improvements alone as evidence.

Explained in depth: Chapter 03 Calibration vs resolution

S

Severe test

A test a mechanism would probably fail if it were not real — with the pass/fail criterion locked before the score is read. “The model fits history” is not severe; “removing this feature must cost at least this much discrimination on training data, and negative controls must not get worse” is. Mechanisms graduate only through severe tests.

Explained in depth: Chapter 04 The Falsification Register

Structural phenomenon

A collective, large-scale event — regime transition, war initiation or termination, institutional collapse, economic phase transition, mass collective action. The project predicts only these. Individual decisions, leader tenure, and unanchored asset prices are permanently out of scope.

Explained in depth: Chapter 01 What it tries to predict

T

Tier

The capability ladder for models. Tier 0: cannot emit a real probability. Tier 1: beats chance (Brier below 0.25). Tier 2: beats market consensus (below ~0.18). Tier 3: approaches superforecaster level (below ~0.15). Tiers are claimed only on deflated, hold-out evidence — not raw scores.

Explained in depth: Chapter 05 Column by column

V

Variant

One model in the zoo: a runnable program plus a written structural hypothesis about the world, its required features, and its admission status. Variants are cheap and disposable by design — the insights they generate outlive them.

Explained in depth: Chapter 02 The model zoo