Engine Evaluation Stability Analysis

Comparing evaluation consistency from 250k to 500k nodes

Generated: 2026-01-25 | Dataset: 6,488 positions across 100 games

1. Raw Signal Stability Ranking

Composite score from four metrics (Spearman ρ, R², systematicity, fraction stable):

Rank	Engine	Composite Score
1	Theoria16	0.979
2	Theoria17-3072	0.510
3	Theoria17-1024	0.272
4	Stockfish 17	0.098

2. Threat Input Feature Analysis

Theoria17 variants use explicit threat features in their NNUE input encoding; Theoria16 and Stockfish 17 do not. This architectural difference introduces depth-dependent noise in the output signal.

Noise Distribution by Game Phase

Positions where T17 shifts >1 pawn but T16 shifts <0.3 pawns:

Phase	T17-Only Unstable	Rate
Opening (1-15)	0 / 2906	0.00%
Middlegame (16-30)	24 / 2127	1.13%
Late Middle (31-50)	19 / 868	2.19%
Endgame (51+)	9 / 206	4.37%

Noise increases with game phase, correlating with tactical feature density in the input encoding.

3. Signal Extraction: Filtering Input Noise

Excluding positions where T17 and T16 diverge by >0.5 pawns isolates 5,781 "low-noise" positions (95.2% of dataset).

Latent Space Consistency (Low-Noise Positions)

Engine	Spearman ρ	R²
Theoria16	0.99757	0.99748
Theoria17-3072	0.99756	0.99554
Theoria17-1024	0.99744	0.99800
Stockfish 17	0.99717	0.99685

T17-1024 achieves highest R² when input noise is filtered.

Cross-Model Embedding Agreement (500k nodes)

Model Pair	Spearman ρ
T17-1024 vs T17-3072	0.99711
SF17 vs T17-1024	0.99582
T16 vs T17-1024	0.99579
T16 vs T17-3072	0.99579
SF17 vs T17-3072	0.99568

T17 variants show highest mutual information, suggesting shared learned representation.

Gradient Correlation (Do models update in the same direction?)

Comparison	Correlation
T17-1024 vs T17-3072	0.2449
Theoria family mean	0.2177
SF17 vs Theoria mean	0.1679

Theoria models converge toward shared optima more than with Stockfish.

4. Conclusion

Theoria17's apparent instability is input-layer noise, not latent representation degradation.

When threat-feature noise is filtered:

T17-1024 achieves highest R² (0.99800)
T17 variants show highest mutual agreement
Theoria family exhibits stronger internal convergence than with Stockfish

The underlying learned embedding is at least as stable in Theoria17 as other engines. Threat features add input-layer complexity that requires deeper search to resolve, but the compressed latent representation remains coherent.

Revised Ranking (Latent Representation Stability)

Rank	Engine	Evidence
1	Theoria17-1024	Highest R² (filtered); highest cross-model agreement
2	Theoria16	Highest raw stability; no input-layer noise
3	Theoria17-3072	Strong T17 family agreement; slightly lower R²
4	Stockfish 17	Lowest Theoria correlation; lowest filtered ρ

5. Architectural Implications

Are explicit threat features redundant?

Theoria16's NNUE, trained on Lc0 evaluations, learns threat information implicitly through the training signal. The network encodes tactical patterns in its weights rather than receiving them as explicit input features.

Explicit threat inputs may be:

Redundant encoding: Information already captured in learned weights
Noise injection: Additional input dimensions that require more search to stabilize
Compression penalty: Expanding input space rather than learning compact representations

The data suggests implicit threat encoding (Theoria16) achieves better signal-to-noise ratio than explicit threat features (Theoria17), while the underlying learned representations remain comparable when noise is filtered.

Download PGN Files

Contains moves, evaluations, and variations

Download: games_stockfish17_250k.pgn
Download: games_stockfish17_500k.pgn
Download: games_theoria16_250k.pgn
Download: games_theoria16_500k.pgn
Download: games_theoria17-1024_250k.pgn
Download: games_theoria17-1024_500k.pgn
Download: games_theoria17-3072_250k.pgn
Download: games_theoria17-3072_500k.pgn