Engine Evaluation Stability Analysis

Comparing evaluation consistency from 250k to 500k nodes

Generated: 2026-01-25 | Dataset: 6,488 positions across 100 games

1. Raw Signal Stability Ranking

Composite score from four metrics (Spearman ρ, R², systematicity, fraction stable):

Rank Engine Composite Score
1 Theoria16 0.979
2 Theoria17-3072 0.510
3 Theoria17-1024 0.272
4 Stockfish 17 0.098

2. Threat Input Feature Analysis

Theoria17 variants use explicit threat features in their NNUE input encoding; Theoria16 and Stockfish 17 do not. This architectural difference introduces depth-dependent noise in the output signal.

Noise Distribution by Game Phase

Positions where T17 shifts >1 pawn but T16 shifts <0.3 pawns:

Phase T17-Only Unstable Rate
Opening (1-15) 0 / 2906 0.00%
Middlegame (16-30) 24 / 2127 1.13%
Late Middle (31-50) 19 / 868 2.19%
Endgame (51+) 9 / 206 4.37%

Noise increases with game phase, correlating with tactical feature density in the input encoding.

3. Signal Extraction: Filtering Input Noise

Excluding positions where T17 and T16 diverge by >0.5 pawns isolates 5,781 "low-noise" positions (95.2% of dataset).

Latent Space Consistency (Low-Noise Positions)

Engine Spearman ρ
Theoria16 0.99757 0.99748
Theoria17-3072 0.99756 0.99554
Theoria17-1024 0.99744 0.99800
Stockfish 17 0.99717 0.99685

T17-1024 achieves highest R² when input noise is filtered.

Cross-Model Embedding Agreement (500k nodes)

Model Pair Spearman ρ
T17-1024 vs T17-3072 0.99711
SF17 vs T17-1024 0.99582
T16 vs T17-1024 0.99579
T16 vs T17-3072 0.99579
SF17 vs T17-3072 0.99568

T17 variants show highest mutual information, suggesting shared learned representation.

Gradient Correlation (Do models update in the same direction?)

Comparison Correlation
T17-1024 vs T17-3072 0.2449
Theoria family mean 0.2177
SF17 vs Theoria mean 0.1679

Theoria models converge toward shared optima more than with Stockfish.

4. Conclusion

Theoria17's apparent instability is input-layer noise, not latent representation degradation.

When threat-feature noise is filtered:

The underlying learned embedding is at least as stable in Theoria17 as other engines. Threat features add input-layer complexity that requires deeper search to resolve, but the compressed latent representation remains coherent.

Revised Ranking (Latent Representation Stability)

Rank Engine Evidence
1 Theoria17-1024 Highest R² (filtered); highest cross-model agreement
2 Theoria16 Highest raw stability; no input-layer noise
3 Theoria17-3072 Strong T17 family agreement; slightly lower R²
4 Stockfish 17 Lowest Theoria correlation; lowest filtered ρ

5. Architectural Implications

Are explicit threat features redundant?

Theoria16's NNUE, trained on Lc0 evaluations, learns threat information implicitly through the training signal. The network encodes tactical patterns in its weights rather than receiving them as explicit input features.

Explicit threat inputs may be:

The data suggests implicit threat encoding (Theoria16) achieves better signal-to-noise ratio than explicit threat features (Theoria17), while the underlying learned representations remain comparable when noise is filtered.

Download PGN Files

Contains moves, evaluations, and variations

Download: games_stockfish17_250k.pgn
Download: games_stockfish17_500k.pgn
Download: games_theoria16_250k.pgn
Download: games_theoria16_500k.pgn
Download: games_theoria17-1024_250k.pgn
Download: games_theoria17-1024_500k.pgn
Download: games_theoria17-3072_250k.pgn
Download: games_theoria17-3072_500k.pgn