Engine Evaluation Stability Analysis

A Comparative Study of Chess Engine Evaluation Consistency Across Search Depths

Abstract: This study examines the evaluation stability of seven chess engines by measuring consistency between 250,000 and 500,000 node searches across 6,488 positions from 100 games. We find that evaluation stability correlates inversely with competitive strength (Elo), suggesting a fundamental tradeoff between strategic coherence and tactical sharpness. Theoria16 achieves exceptional stability (96.4%, Grade A) but at an 80 Elo cost relative to Stockfish 17 (38.0%, Grade D+). Analysis of the Optimism parameter reveals that a small systematic bias produces more coherent evaluations than unbiased assessment, with implications for both engine design and chess pedagogy. We identify a "comprehensibility boundary" around 3620 Elo, above which engines appear to sacrifice human-interpretable strategic coherence for competitive gains.

1. Introduction

Chess engine evaluation has reached superhuman levels, yet the nature of these evaluations remains poorly understood. Do stronger engines produce more accurate assessments of positions, or do they achieve competitive success through mechanisms that resist human interpretation?

This study investigates evaluation stability—the consistency of an engine's position assessment as search depth increases. We hypothesize that engines optimized for competitive fitness may exhibit different stability profiles than engines trained on outcome-based (teleological) evaluation data.

Our analysis encompasses engines from different training paradigms: Stockfish 17 (fitness-trained via competitive selection), Theoria16 (trained on Leela Chess Zero evaluation data), and several other engines including Obsidian, Caissa, and Theoria17 variants with different NNUE architectures.

2. Why Stability Matters for Strategic Analysis

Evaluation stability—the consistency of an engine's assessment as search depth increases—has direct implications for the quality and interpretability of chess analysis.

2.1 Coherent Strategic Narratives

When an engine's evaluation shifts erratically between search depths, the implied "plan" changes with it. A position assessed as +0.3 at 250k nodes might jump to +0.8 at 500k nodes—not because the engine found a winning tactic, but because its evaluation recalibrated in ways that don't correspond to any identifiable strategic change. This makes it impossible to construct coherent narratives about why certain moves are good.

Stable evaluations, by contrast, suggest the engine has identified something real about the position that persists under deeper scrutiny. When Theoria16 assesses a position similarly at both depths, the human analyst can trust that the evaluation reflects genuine positional features rather than search artifacts.

2.2 Pedagogical Reliability

Chess learners use engine analysis to understand why moves are good or bad. If evaluations fluctuate significantly with search depth, the "lesson" changes depending on how long you let the engine think. An unstable engine might suggest one plan at low depth and a contradictory plan at high depth, confusing rather than educating the student.

Stable engines provide consistent feedback: the assessment at quick analysis depth points in the same direction as deeper analysis, allowing learners to develop reliable intuitions without requiring infinite computation time.

2.3 Strategic vs Tactical Evaluation

High systematicity (uniform corrections across positions) suggests an engine is refining a global strategic assessment—adjusting its understanding of the position as a whole. Low systematicity (scattered corrections) suggests the engine is finding position-specific tactical shots that weren't visible at lower depth.

For strategic analysis, high systematicity is preferable: it indicates the engine's evaluation is based on persistent positional features rather than tactical accidents. For tactical training, low systematicity might be acceptable, but it makes strategic interpretation unreliable.

2.4 The Manifold Interpretation

Chess positions can be understood as occupying a high-dimensional evaluation manifold. Stable evaluations suggest the engine has learned a smooth, coherent mapping of this manifold—positions with similar features receive similar evaluations, and small changes in search depth produce proportional changes in assessment.

Unstable evaluations suggest a fragmented manifold: the engine's assessment depends heavily on which tactical fragments it happens to discover at a given depth, producing evaluations that are locally accurate but globally incoherent. Such engines may play strong chess by "stitching together" tactical fragments through search, but their static evaluations resist strategic interpretation.

2.5 Practical Implications

For practical analysis work—annotating games, preparing openings, understanding positional themes—stability determines whether engine output is trustworthy at reasonable time controls. An engine with 96% stability (Theoria16) produces analysis at 250k nodes that largely agrees with 500k node analysis. An engine with 38% stability (Stockfish 17) may require significantly deeper search before its evaluations stabilize, and even then the strategic narrative may remain opaque.

This does not mean unstable engines are inferior for all purposes. Stockfish's tactical sharpness makes it the strongest competitive player. But for human-oriented strategic analysis, stability is a distinct and important quality that competitive strength does not guarantee.

3. Methodology

3.1 Dataset

We analyzed 6,488 positions from 100 Lichess games (Elo 1450-1550). Each position was evaluated by all engines at both 250,000 nodes and 500,000 nodes, allowing measurement of evaluation shift with increased search depth.

3.2 Engines Tested

3.3 Stability Metrics

Four metrics were computed for each engine:

Metrics were normalized to 0-100% and averaged to produce a composite stability score.

4. Results

4.1 Stability Rankings

Rank Engine Score Grade Elo
1 Theoria16 (Opt=true) 96.4% A 3563
2 Obsidian 62.3% C+ 3635
3 Theoria17-3072 52.5% C 3563
4 Theoria16 (Opt=false) 45.4% C- 3563
5 Caissa 40.9% D+ 3623
6 Stockfish 17 38.0% D+ 3643
7 Theoria17-1024 37.0% D+ 3563

Horizontal bar chart showing engine stability scores: Theoria16 with Optimism enabled leads with 96.4% (Grade A), followed by Obsidian at 62.3% (C+), Theoria17-3072 at 52.5% (C), Theoria16 with Optimism disabled at 45.4% (C-), Caissa at 40.9% (D+), Stockfish 17 at 38.0% (D+), and Theoria17-1024 at 37.0% (D+). Bar colors gradient from blue (high stability) to orange (low stability).

4.2 Detailed Metric Breakdown

Engine Spearman ρ Systematicity Frac Stable
Theoria16 (Opt=true) 0.99789 0.99123 0.3601 43.3%
Obsidian 0.99756 0.96662 0.2163 36.2%
Theoria17-3072 0.99792 0.87167 0.1227 40.3%
Theoria16 (Opt=false) 0.99668 0.89642 0.1272 45.3%
Caissa 0.99625 0.97731 0.2890 29.5%
Stockfish 17 0.99752 0.86503 0.1252 35.3%
Theoria17-1024 0.99781 0.79386 0.1177 38.1%

4.3 Stability vs Elo Correlation

Within fitness-trained engines (Stockfish, Obsidian, Caissa), we observed strong inverse correlations between stability metrics and Elo:

Metric Correlation with Elo
Systematicity r = -0.98
r = -0.85
Fraction Stable r = +0.86

The strong inverse correlation between systematicity and Elo suggests that evaluation flexibility—willingness to revise assessments significantly with more search—may be competitively advantageous.

5. Key Findings

5.1 The Optimism Effect

The most striking finding involves Theoria16's Optimism parameter. With Optimism enabled (default), Theoria16 achieves exceptional stability (96.4%, Grade A). With Optimism disabled, stability drops dramatically (45.4%, Grade C-).

Setting Systematicity Composite
Optimism=true 0.991 0.360 96.4%
Optimism=false 0.896 0.127 45.4%

Grouped bar chart comparing Theoria16 with Optimism enabled versus disabled across three metrics. R-squared: 99.1% with Optimism on versus 89.6% off. Systematicity: 36.0% on versus 12.7% off. Fraction Stable: 43.3% on versus 45.3% off. Blue bars represent Optimism enabled, orange bars represent Optimism disabled. The chart shows dramatic drops in R-squared and systematicity when Optimism is disabled.

The Optimism parameter introduces a small systematic bias (+0.1 to +0.2 pawns), causing the engine to believe its position is slightly better than objective assessment. This bias acts as a regularizer, anchoring evaluations to a consistent reference frame and producing uniform corrections across positions.

Without this bias, evaluations scatter across the manifold's local variations. Paradoxically, a small systematic lie produces more coherent, pedagogically useful analysis than unbiased truth.

5.2 The Comprehensibility Boundary (~3620 Elo)

Analysis of the Elo-stability relationship reveals a phase transition around 3620 Elo:

Engine Elo Systematicity
Theoria16 3563 0.360
Caissa 3623 0.289
Obsidian 3635 0.216
Stockfish 17 3643 0.125

Scatter plot showing the relationship between Elo rating (x-axis) and Systematicity (y-axis) for four chess engines. Theoria16 at 3563 Elo has the highest systematicity at 0.36. Caissa at 3623 Elo has systematicity of 0.29. Obsidian at 3635 Elo has systematicity of 0.22. Stockfish 17 at 3643 Elo has the lowest systematicity at 0.13. A vertical dashed orange line marks the comprehensibility boundary at approximately 3620 Elo. The region below 3620 is shaded blue and labeled Coherent Zone; the region above is shaded orange and labeled Dark Forest.

The Elo gap from Caissa (3623) to Stockfish (3643) is only 20 points, but systematicity nearly halves (0.289 → 0.125). Above ~3620 Elo, engines appear to trade substantial strategic coherence for marginal competitive gains.

This suggests a boundary where evaluation must become "strategically incoherent" (to humans) to gain further competitive strength. Below this line, engines can still produce human-interpretable analysis; above it, they enter what Mikhail Tal called the "dark forest"—positions navigated by pattern recognition rather than articulable strategy.

5.3 Threat Input Architecture Effects

Theoria17 variants, which use explicit threat inputs in their NNUE architecture, show reduced stability compared to Theoria16 despite identical training data:

Engine Threat Inputs Composite
Theoria16 No 0.991 96.4%
Theoria17-3072 Yes 0.872 52.5%
Theoria17-1024 Yes 0.794 37.0%

Theoria17-specific instability increases with game phase (0% in openings, 4.4% in endgames), correlating with tactical complexity. This suggests threat inputs add depth-dependent noise at the encoding layer, though the underlying learned representation remains coherent when this noise is filtered.

Theoria16's NNUE learns threat information implicitly through training, achieving better signal-to-noise ratio than explicit threat encoding. This has implications for NNUE architecture design: implicit learning may produce more stable evaluations than explicit feature engineering.

5.4 R² vs Systematicity: Two Distinct Properties

Our analysis revealed that R² (linear scaling consistency) and systematicity (correction uniformity) measure different aspects of stability:

High R² with medium systematicity suggests an engine that applies larger corrections to complex positions and smaller corrections to settled positions, while maintaining consistent overall scaling. This may represent a more sophisticated form of stability than uniform correction (high systematicity) alone.

6. Discussion

6.1 The Stability-Elo Tradeoff

Our findings suggest a fundamental tradeoff between evaluation stability and competitive strength. This tradeoff may reflect the geometry of the chess evaluation manifold itself.

Below ~3620 Elo, positions that decide games lie on the "well-behaved" region of the manifold—where strategic concepts connect smoothly and evaluation generalizes. Above this threshold, competitive edges increasingly come from exploiting encoding artifacts and tactical fragments that don't compose into human-interpretable strategy.

Low systematicity may indicate tactical sharpness—the engine finds specific tactical shots in some positions (large corrections) while confirming already-resolved positions in others (no change). High systematicity suggests positional/strategic evaluation where adjustments refine a global assessment uniformly.

6.2 Implications for Chess Pedagogy

Grandmasters operate through pattern recognition trained on centuries of human chess—a corpus that lives on the smooth manifold. The ~3620+ region contains patterns absent from human game databases: positions that arise only in engine-versus-engine play, with no names, thematic labels, or pedagogical tradition.

Theoria16's profile suggests it may represent the strongest engine capable of producing consistently interpretable analysis. For chess education, this "comprehensibility ceiling" has practical implications: analysis from engines above 3620 Elo may be mathematically optimal but pedagogically limited.

6.3 Implications for Engine Design

The Optimism finding suggests that evaluation regularization—introducing small systematic biases—can dramatically improve analysis coherence without proportional competitive cost. Engine developers might consider:

7. Conclusion

Evaluation stability and competitive strength appear to trade off against each other in modern chess engines. Theoria16 achieves exceptional stability (Grade A) through teleological training and optimism-based regularization, while Stockfish 17 achieves maximum Elo (3643) at the cost of evaluation coherence (Grade D+).

We identify a comprehensibility boundary around 3620 Elo, above which engines sacrifice human-interpretable strategic coherence for marginal competitive gains. This boundary may represent a phase transition in the chess evaluation manifold—the point where linear strategic logic gives way to conditional, context-dependent pattern matching.

For practical analysis, these findings suggest that the strongest engine is not always the best teacher. Theoria16's stability profile makes it more suitable for pedagogical applications despite its 80 Elo deficit, while Stockfish's tactical sharpness comes at the cost of strategic interpretability.

Future work should investigate whether the comprehensibility boundary can be extended through novel training approaches, whether human pattern recognition can be trained on the ~3620+ region through immersive exposure, and whether hybrid architectures can achieve both stability and competitive strength.

8. Grade Key

Grade Score Range Interpretation
A 93-100% Exceptional stability
B+ 80-84% Very good stability
B 73-79% Good stability
C+ 58-64% Above average stability
C 50-57% Moderate stability
C- 43-49% Below average stability
D+ 35-42% Poor stability
D 28-34% Very poor stability
F <20% Unstable

Reference Data


Methodology Note: This analysis was conducted using position evaluations extracted from PGN files at 250,000 and 500,000 nodes per position. Statistical analysis performed using Python with SciPy for correlation metrics. Elo ratings sourced from CCRL 40/15 rating list (January 2026). All engines tested on identical position sets derived from 100 Lichess games (Elo 1450-1550).