Engine Evaluation Stability Analysis

A Comparative Study of Chess Engine Evaluation Consistency Across Search Depths

Abstract: This study examines the evaluation stability of seven chess engines by measuring consistency between 250,000 and 500,000 node searches across 6,488 positions from 100 games. We find that evaluation stability correlates inversely with competitive strength (Elo), suggesting a fundamental tradeoff between strategic coherence and tactical sharpness. Theoria16 achieves exceptional stability (96.4%, Grade A) but at an 80 Elo cost relative to Stockfish 17 (38.0%, Grade D+). Analysis of the Optimism parameter reveals that a small systematic bias produces more coherent evaluations than unbiased assessment, with implications for both engine design and chess pedagogy. We identify a "comprehensibility boundary" around 3620 Elo, above which engines appear to sacrifice human-interpretable strategic coherence for competitive gains.

1. Introduction

Chess engine evaluation has reached superhuman levels, yet the nature of these evaluations remains poorly understood. Do stronger engines produce more accurate assessments of positions, or do they achieve competitive success through mechanisms that resist human interpretation?

This study investigates evaluation stability—the consistency of an engine's position assessment as search depth increases. We hypothesize that engines optimized for competitive fitness may exhibit different stability profiles than engines trained on outcome-based (teleological) evaluation data.

Our analysis encompasses engines from different training paradigms: Stockfish 17 (fitness-trained via competitive selection), Theoria16 (trained on Leela Chess Zero evaluation data), and several other engines including Obsidian, Caissa, and Theoria17 variants with different NNUE architectures.

2. Why Stability Matters for Strategic Analysis

Evaluation stability—the consistency of an engine's assessment as search depth increases—has direct implications for the quality and interpretability of chess analysis.

2.1 Coherent Strategic Narratives

When an engine's evaluation shifts erratically between search depths, the implied "plan" changes with it. A position assessed as +0.3 at 250k nodes might jump to +0.8 at 500k nodes—not because the engine found a winning tactic, but because its evaluation recalibrated in ways that don't correspond to any identifiable strategic change. This makes it impossible to construct coherent narratives about why certain moves are good.

Stable evaluations, by contrast, suggest the engine has identified something real about the position that persists under deeper scrutiny. When Theoria16 assesses a position similarly at both depths, the human analyst can trust that the evaluation reflects genuine positional features rather than search artifacts.

2.2 Pedagogical Reliability

Chess learners use engine analysis to understand why moves are good or bad. If evaluations fluctuate significantly with search depth, the "lesson" changes depending on how long you let the engine think. An unstable engine might suggest one plan at low depth and a contradictory plan at high depth, confusing rather than educating the student.

Stable engines provide consistent feedback: the assessment at quick analysis depth points in the same direction as deeper analysis, allowing learners to develop reliable intuitions without requiring infinite computation time.

2.3 Strategic vs Tactical Evaluation

High systematicity (uniform corrections across positions) suggests an engine is refining a global strategic assessment—adjusting its understanding of the position as a whole. Low systematicity (scattered corrections) suggests the engine is finding position-specific tactical shots that weren't visible at lower depth.

For strategic analysis, high systematicity is preferable: it indicates the engine's evaluation is based on persistent positional features rather than tactical accidents. For tactical training, low systematicity might be acceptable, but it makes strategic interpretation unreliable.

2.4 The Manifold Interpretation

Chess positions can be understood as occupying a high-dimensional evaluation manifold. Stable evaluations suggest the engine has learned a smooth, coherent mapping of this manifold—positions with similar features receive similar evaluations, and small changes in search depth produce proportional changes in assessment.

Unstable evaluations suggest a fragmented manifold: the engine's assessment depends heavily on which tactical fragments it happens to discover at a given depth, producing evaluations that are locally accurate but globally incoherent. Such engines may play strong chess by "stitching together" tactical fragments through search, but their static evaluations resist strategic interpretation.

2.5 Practical Implications

For practical analysis work—annotating games, preparing openings, understanding positional themes—stability determines whether engine output is trustworthy at reasonable time controls. An engine with 96% stability (Theoria16) produces analysis at 250k nodes that largely agrees with 500k node analysis. An engine with 38% stability (Stockfish 17) may require significantly deeper search before its evaluations stabilize, and even then the strategic narrative may remain opaque.

This does not mean unstable engines are inferior for all purposes. Stockfish's tactical sharpness makes it the strongest competitive player. But for human-oriented strategic analysis, stability is a distinct and important quality that competitive strength does not guarantee.

3. Methodology

3.1 Dataset

We analyzed 6,488 positions from 100 Lichess games (Elo 1450-1550). Each position was evaluated by all engines at both 250,000 nodes and 500,000 nodes, allowing measurement of evaluation shift with increased search depth.

3.2 Engines Tested

Stockfish 17: Reference engine, fitness-trained (Elo 3643)
Obsidian 16.0: Stockfish derivative (Elo 3635)
Caissa 1.23: Independent engine (Elo 3623)
Theoria16: Stockfish 16.1 architecture with Lc0-derived NNUE (Elo ~3563)
Theoria16 (Optimism=false): Same engine with optimism parameter disabled
Theoria17-1024: Stockfish 17 base with threat-input NNUE (L1=1024)
Theoria17-3072: Stockfish 17 base with threat-input NNUE (L1=3072)

3.3 Stability Metrics

Four metrics were computed for each engine:

Spearman ρ: Rank-order correlation between 250k and 500k evaluations. Scale-invariant measure of whether positions maintain relative ordering.
R² (Coefficient of Determination): Linear consistency of evaluation scaling. Higher R² indicates evaluations scale proportionally between search depths.
Systematicity: Ratio of mean absolute shift to standard deviation of shifts. Higher values indicate uniform corrections; lower values indicate scattered, position-specific adjustments.
Fraction Stable: Proportion of positions where evaluation changed by less than 5 centipawns.

Metrics were normalized to 0-100% and averaged to produce a composite stability score.

4. Results

4.1 Stability Rankings

Rank	Engine	Score	Grade	Elo
1	Theoria16 (Opt=true)	96.4%	A	3563
2	Obsidian	62.3%	C+	3635
3	Theoria17-3072	52.5%	C	3563
4	Theoria16 (Opt=false)	45.4%	C-	3563
5	Caissa	40.9%	D+	3623
6	Stockfish 17	38.0%	D+	3643
7	Theoria17-1024	37.0%	D+	3563

Horizontal bar chart showing engine stability scores: Theoria16 with Optimism enabled leads with 96.4% (Grade A), followed by Obsidian at 62.3% (C+), Theoria17-3072 at 52.5% (C), Theoria16 with Optimism disabled at 45.4% (C-), Caissa at 40.9% (D+), Stockfish 17 at 38.0% (D+), and Theoria17-1024 at 37.0% (D+). Bar colors gradient from blue (high stability) to orange (low stability).

4.2 Detailed Metric Breakdown

Engine	Spearman ρ	R²	Systematicity	Frac Stable
Theoria16 (Opt=true)	0.99789	0.99123	0.3601	43.3%
Obsidian	0.99756	0.96662	0.2163	36.2%
Theoria17-3072	0.99792	0.87167	0.1227	40.3%
Theoria16 (Opt=false)	0.99668	0.89642	0.1272	45.3%
Caissa	0.99625	0.97731	0.2890	29.5%
Stockfish 17	0.99752	0.86503	0.1252	35.3%
Theoria17-1024	0.99781	0.79386	0.1177	38.1%

4.3 Stability vs Elo Correlation

Within fitness-trained engines (Stockfish, Obsidian, Caissa), we observed strong inverse correlations between stability metrics and Elo:

Metric	Correlation with Elo
Systematicity	r = -0.98
R²	r = -0.85
Fraction Stable	r = +0.86

The strong inverse correlation between systematicity and Elo suggests that evaluation flexibility—willingness to revise assessments significantly with more search—may be competitively advantageous.

5. Key Findings

5.1 The Optimism Effect

The most striking finding involves Theoria16's Optimism parameter. With Optimism enabled (default), Theoria16 achieves exceptional stability (96.4%, Grade A). With Optimism disabled, stability drops dramatically (45.4%, Grade C-).

Setting	R²	Systematicity	Composite
Optimism=true	0.991	0.360	96.4%
Optimism=false	0.896	0.127	45.4%

The Optimism parameter introduces a small systematic bias (+0.1 to +0.2 pawns), causing the engine to believe its position is slightly better than objective assessment. This bias acts as a regularizer, anchoring evaluations to a consistent reference frame and producing uniform corrections across positions.

Without this bias, evaluations scatter across the manifold's local variations. Paradoxically, a small systematic lie produces more coherent, pedagogically useful analysis than unbiased truth.

5.2 The Comprehensibility Boundary (~3620 Elo)

Analysis of the Elo-stability relationship reveals a phase transition around 3620 Elo:

Engine	Elo	Systematicity
Theoria16	3563	0.360
Caissa	3623	0.289
Obsidian	3635	0.216
Stockfish 17	3643	0.125

The Elo gap from Caissa (3623) to Stockfish (3643) is only 20 points, but systematicity nearly halves (0.289 → 0.125). Above ~3620 Elo, engines appear to trade substantial strategic coherence for marginal competitive gains.

This suggests a boundary where evaluation must become "strategically incoherent" (to humans) to gain further competitive strength. Below this line, engines can still produce human-interpretable analysis; above it, they enter what Mikhail Tal called the "dark forest"—positions navigated by pattern recognition rather than articulable strategy.

5.3 Threat Input Architecture Effects

Theoria17 variants, which use explicit threat inputs in their NNUE architecture, show reduced stability compared to Theoria16 despite identical training data:

Engine	Threat Inputs	R²	Composite
Theoria16	No	0.991	96.4%
Theoria17-3072	Yes	0.872	52.5%
Theoria17-1024	Yes	0.794	37.0%

Theoria17-specific instability increases with game phase (0% in openings, 4.4% in endgames), correlating with tactical complexity. This suggests threat inputs add depth-dependent noise at the encoding layer, though the underlying learned representation remains coherent when this noise is filtered.

Theoria16's NNUE learns threat information implicitly through training, achieving better signal-to-noise ratio than explicit threat encoding. This has implications for NNUE architecture design: implicit learning may produce more stable evaluations than explicit feature engineering.

5.4 R² vs Systematicity: Two Distinct Properties

Our analysis revealed that R² (linear scaling consistency) and systematicity (correction uniformity) measure different aspects of stability:

Caissa: High R² (0.977) but medium systematicity (0.289)—corrections vary in size but scale proportionally
Obsidian: High R² (0.967) with medium systematicity (0.216)—similar profile to Caissa
Stockfish: Low R² (0.865) and low systematicity (0.125)—scattered corrections that don't scale consistently

High R² with medium systematicity suggests an engine that applies larger corrections to complex positions and smaller corrections to settled positions, while maintaining consistent overall scaling. This may represent a more sophisticated form of stability than uniform correction (high systematicity) alone.

6. Discussion

6.1 The Stability-Elo Tradeoff

Our findings suggest a fundamental tradeoff between evaluation stability and competitive strength. This tradeoff may reflect the geometry of the chess evaluation manifold itself.

Below ~3620 Elo, positions that decide games lie on the "well-behaved" region of the manifold—where strategic concepts connect smoothly and evaluation generalizes. Above this threshold, competitive edges increasingly come from exploiting encoding artifacts and tactical fragments that don't compose into human-interpretable strategy.

Low systematicity may indicate tactical sharpness—the engine finds specific tactical shots in some positions (large corrections) while confirming already-resolved positions in others (no change). High systematicity suggests positional/strategic evaluation where adjustments refine a global assessment uniformly.

6.2 Implications for Chess Pedagogy

Grandmasters operate through pattern recognition trained on centuries of human chess—a corpus that lives on the smooth manifold. The ~3620+ region contains patterns absent from human game databases: positions that arise only in engine-versus-engine play, with no names, thematic labels, or pedagogical tradition.

Theoria16's profile suggests it may represent the strongest engine capable of producing consistently interpretable analysis. For chess education, this "comprehensibility ceiling" has practical implications: analysis from engines above 3620 Elo may be mathematically optimal but pedagogically limited.

6.3 Implications for Engine Design

The Optimism finding suggests that evaluation regularization—introducing small systematic biases—can dramatically improve analysis coherence without proportional competitive cost. Engine developers might consider:

Tunable optimism parameters for analysis versus competitive play
Implicit rather than explicit tactical feature encoding
Architecture choices that preserve manifold smoothness

7. Conclusion

Evaluation stability and competitive strength appear to trade off against each other in modern chess engines. Theoria16 achieves exceptional stability (Grade A) through teleological training and optimism-based regularization, while Stockfish 17 achieves maximum Elo (3643) at the cost of evaluation coherence (Grade D+).

We identify a comprehensibility boundary around 3620 Elo, above which engines sacrifice human-interpretable strategic coherence for marginal competitive gains. This boundary may represent a phase transition in the chess evaluation manifold—the point where linear strategic logic gives way to conditional, context-dependent pattern matching.

For practical analysis, these findings suggest that the strongest engine is not always the best teacher. Theoria16's stability profile makes it more suitable for pedagogical applications despite its 80 Elo deficit, while Stockfish's tactical sharpness comes at the cost of strategic interpretability.

Future work should investigate whether the comprehensibility boundary can be extended through novel training approaches, whether human pattern recognition can be trained on the ~3620+ region through immersive exposure, and whether hybrid architectures can achieve both stability and competitive strength.

8. Grade Key

Grade	Score Range	Interpretation
A	93-100%	Exceptional stability
B+	80-84%	Very good stability
B	73-79%	Good stability
C+	58-64%	Above average stability
C	50-57%	Moderate stability
C-	43-49%	Below average stability
D+	35-42%	Poor stability
D	28-34%	Very poor stability
F	<20%	Unstable

Reference Data

Methodology Note: This analysis was conducted using position evaluations extracted from PGN files at 250,000 and 500,000 nodes per position. Statistical analysis performed using Python with SciPy for correlation metrics. Elo ratings sourced from CCRL 40/15 rating list (January 2026). All engines tested on identical position sets derived from 100 Lichess games (Elo 1450-1550).