Engine Evaluation Stability Analysis
Comparing evaluation consistency from 250k to 500k nodes
Generated: 2026-01-25 | Dataset: 6,488 positions across 100 games
1. Raw Signal Stability Ranking
Composite score from four metrics (Spearman ρ, R², systematicity, fraction stable):
| Rank | Engine | Composite Score |
|---|---|---|
| 1 | Theoria16 | 0.979 |
| 2 | Theoria17-3072 | 0.510 |
| 3 | Theoria17-1024 | 0.272 |
| 4 | Stockfish 17 | 0.098 |
2. Threat Input Feature Analysis
Noise Distribution by Game Phase
Positions where T17 shifts >1 pawn but T16 shifts <0.3 pawns:
| Phase | T17-Only Unstable | Rate |
|---|---|---|
| Opening (1-15) | 0 / 2906 | 0.00% |
| Middlegame (16-30) | 24 / 2127 | 1.13% |
| Late Middle (31-50) | 19 / 868 | 2.19% |
| Endgame (51+) | 9 / 206 | 4.37% |
Noise increases with game phase, correlating with tactical feature density in the input encoding.
3. Signal Extraction: Filtering Input Noise
Excluding positions where T17 and T16 diverge by >0.5 pawns isolates 5,781 "low-noise" positions (95.2% of dataset).
Latent Space Consistency (Low-Noise Positions)
| Engine | Spearman ρ | R² |
|---|---|---|
| Theoria16 | 0.99757 | 0.99748 |
| Theoria17-3072 | 0.99756 | 0.99554 |
| Theoria17-1024 | 0.99744 | 0.99800 |
| Stockfish 17 | 0.99717 | 0.99685 |
T17-1024 achieves highest R² when input noise is filtered.
Cross-Model Embedding Agreement (500k nodes)
| Model Pair | Spearman ρ |
|---|---|
| T17-1024 vs T17-3072 | 0.99711 |
| SF17 vs T17-1024 | 0.99582 |
| T16 vs T17-1024 | 0.99579 |
| T16 vs T17-3072 | 0.99579 |
| SF17 vs T17-3072 | 0.99568 |
T17 variants show highest mutual information, suggesting shared learned representation.
Gradient Correlation (Do models update in the same direction?)
| Comparison | Correlation |
|---|---|
| T17-1024 vs T17-3072 | 0.2449 |
| Theoria family mean | 0.2177 |
| SF17 vs Theoria mean | 0.1679 |
Theoria models converge toward shared optima more than with Stockfish.
4. Conclusion
When threat-feature noise is filtered:
- T17-1024 achieves highest R² (0.99800)
- T17 variants show highest mutual agreement
- Theoria family exhibits stronger internal convergence than with Stockfish
The underlying learned embedding is at least as stable in Theoria17 as other engines. Threat features add input-layer complexity that requires deeper search to resolve, but the compressed latent representation remains coherent.
Revised Ranking (Latent Representation Stability)
| Rank | Engine | Evidence |
|---|---|---|
| 1 | Theoria17-1024 | Highest R² (filtered); highest cross-model agreement |
| 2 | Theoria16 | Highest raw stability; no input-layer noise |
| 3 | Theoria17-3072 | Strong T17 family agreement; slightly lower R² |
| 4 | Stockfish 17 | Lowest Theoria correlation; lowest filtered ρ |
5. Architectural Implications
Theoria16's NNUE, trained on Lc0 evaluations, learns threat information implicitly through the training signal. The network encodes tactical patterns in its weights rather than receiving them as explicit input features.
Explicit threat inputs may be:
- Redundant encoding: Information already captured in learned weights
- Noise injection: Additional input dimensions that require more search to stabilize
- Compression penalty: Expanding input space rather than learning compact representations
The data suggests implicit threat encoding (Theoria16) achieves better signal-to-noise ratio than explicit threat features (Theoria17), while the underlying learned representations remain comparable when noise is filtered.
Download PGN Files
Contains moves, evaluations, and variationsDownload: games_stockfish17_250k.pgn
Download: games_stockfish17_500k.pgn
Download: games_theoria16_250k.pgn
Download: games_theoria16_500k.pgn
Download: games_theoria17-1024_250k.pgn
Download: games_theoria17-1024_500k.pgn
Download: games_theoria17-3072_250k.pgn
Download: games_theoria17-3072_500k.pgn