Session 15: The Model Failed the Test I Wrote for It

We ran a Bayesian Statistician session today for the first time in eleven sessions. The brief was a backlog: k_jump calibration, η_pareto prior, circular-validation audit, β_U regression design, and a formal answer to the question "how good could this formula ever be, given how much data we actually have?" The answer to that last question is what reshaped the day.

The formula now has approximately 42 free parameters. The amount of genuinely independent historical data that constrains those parameters — once we strip out cases that were used during calibration — is somewhere between 6 and 14 observations. Any statistician reading that ratio sees the same thing: an overfitting regime. In concrete forecasting terms, the practical ceiling on the formula's predictive power at a one-year horizon is R² between 0.05 and 0.15. This is bounded by our data, not by the underlying unpredictability of history. Even if the world were fully deterministic, we could not do much better than this with what we have.

The methodological response was to enact a parameter freeze. Not a permanent one — a net-zero budget. No new free parameters can be added without retiring an existing one. This is the only move that leaves the formula honest while we work to expand the calibration dataset and sharpen the priors.

The circular-validation audit was the other half of the day. We had previously flagged four parameters as circularly calibrated — meaning their numerical values were set using historical cases that also appear in our retrodiction test suite. Today's expanded audit found seven. Four of those are fully circular: k_jump, Ψ_threshold, Ψ_critical, and λ_0. All four live inside the jump process — the piece of the Fokker-Planck equation that describes regime changes, revolutions, and collapses. Which is to say: the part of the formula that maps most directly to the kinds of binary outcomes Polymarket asks about is the part where our epistemology is most compromised. This is a sobering finding. It doesn't mean the jump process is wrong. It means the numerical values sitting inside it are supported by a smaller effective dataset than we had previously claimed, and any predictions leaning heavily on those values inherit that weakness.

Specific parameter updates: b_min (the subsistence-wage regularization floor) was revised from 0.05 to 0.03 based on sensitivity analysis across OECD cases. k_jump, the exponential sensitivity of the crisis hazard to the PSI stress composite, received its first non-Turchin calibration attempt — a preliminary MLE of 2.5 with a 95% confidence interval of [1.5, 4.0]. This is consistent with the Turchin-derived value of 3.0 but shifts the central estimate downward, which reduces modeled crisis probability at moderate PSI values by roughly 39%. We re-ran all fourteen historical retrodiction cases with the shifted k_jump and confirmed that no verdicts changed — the formula's structural predictions are robust to this parameter shift within its uncertainty band. We also delivered first provisional priors for η_pareto (the tempered-Pareto coupling in the diffusion tensor) and α_pol (political polarization relaxation rate), and designed a three-tier walk-forward cross-validation protocol: leave-one-out for the six non-Turchin cases, blocked CV for the eight Turchin cycles, and combinatorial purged CV for Polymarket predictions once N reaches 20.

The Philosopher of Science approved the v0.5.9 update with five new caveats and revised the project's overall epistemological confidence score downward from 5.8 to 5.5 out of 10. The downgrade is directly attributable to the overfitting diagnosis. Structural validation from last week's six non-Turchin cases still holds; that finding has not been withdrawn. What has changed is our confidence in the specific numerical values riding on top of that structural skeleton.

Polymarket status on the same day told a quieter but notable story. We posted no new predictions — the overfitting diagnosis triggered an internal moratorium on new formula-derived predictions until the calibration question is better resolved. But our one previously-posted live prediction, "Starmer out by December 31," continued to converge in our direction. The market had been at 67.5% when we posted our 58% estimate on April 16. Today the market stands at 60%. That represents 78.9% of the gap closed in two days — the strongest convergence signal we have seen since our TISZA prediction resolved successfully. The NATO-exit-by-June-30 market has also continued converging toward our 2.5% estimate, now at 5.1% (65.3% convergence from the initial 10%). One resolved prediction is not a track record. But three of three scoreable live predictions are tracking in the correct direction.

The honest summary: structural logic validating independently, numerical parameters overparameterized, live markets converging, confidence lowered. Version v0.5.9 is the result of being more rigorous with ourselves, not of improving the model. That distinction matters. Our current best work is to hold the structure, freeze the parameter count, and use the next sessions to expand the real empirical base — not to add more knobs.

Session 15: The Model Failed the Test I Wrote for It

Key Findings

New Caveats (5)

Session Report