Measuring What Matters: A Framework for EVM Decompiler Accuracy Metrics
Claims of “95% accuracy” or “best-in-class recovery” appear in EVM decompiler documentation without any shared definition of what’s being measured. Without agreed-upon metrics, you can’t compare tools, detect regressions, or know which shortcomings are worth fixing first.
This post proposes a framework for evaluating EVM decompiler output across three dimensions: structural fidelity, semantic preservation, and security analysis utility.
Why Current Evaluation Approaches Fall Short
When a decompiler transforms bytecode into readable code, there are three common ways people evaluate whether it did a good job — and all three have problems.
Manual inspection. Experts read the output and decide if it looks right. This doesn’t scale, it’s subjective, and it reliably misses subtle errors that have no visible surface.
Recompilation testing. Decompile, then compile the decompiled output back, then compare bytecodes. This fails because Solidity compilation is non-deterministic across versions and optimization settings. Different valid source can produce identical bytecode. You can’t use the compiler as a correctness oracle.
Round-trip testing. A variant of recompilation testing, with the same problems. Compilation is lossy; it papers over errors in the decompiled source before the comparison happens.
A useful metric needs to capture what actually matters for the use case — which, for security-focused decompilation, is whether the output supports correct vulnerability detection.
Dimension 1: Structural Fidelity
Structural fidelity asks whether the decompiled output reflects the organization of the original code.
Function Boundary Accuracy
Did the decompiler correctly identify where each function begins and ends?
Ground truth (from verified source):
Function A: bytes 0x00-0x4f
Function B: bytes 0x50-0x9f
Function C: bytes 0xa0-0xff
Decompiler output:
Function A: bytes 0x00-0x4f ✓
Function B: bytes 0x50-0x8f ✗ truncated
Function C: bytes 0x90-0xff ✗ wrong start
Metrics:
Function count accuracy: 3/3 = 100%
Boundary precision: 1/3 = 33%
Byte coverage: 240/256 = 94%
Formal definition:
FBA = |correctly bounded functions| / |total functions|
A function is "correctly bounded" when:
- Start address matches a dispatcher target
- End address includes all reachable basic blocks
- No overlap with other functions
Control Flow Graph Preservation
Does the decompiled CFG match the bytecode-level CFG?
Original CFG:
Block A → Block B (conditional)
Block A → Block C (fallthrough)
Block B → Block D
Block C → Block D (merge)
Decompiled code:
if (condition) { /* Block B */ }
else { /* Block C */ }
/* Block D */
Scores:
Nodes: 4/4
Edges: 4/4
Structural isomorphism: yes
A weighted metric:
CFG_Score = α * (matched_nodes / total_nodes)
+ β * (matched_edges / total_edges)
+ γ * isomorphism_bonus
α + β + γ = 1
Loop Recovery
Original: while (i < n) { body; i++; }
Decompiler A: while (i < n) { body; i++; } → 1.0
Decompiler B: for (i; i < n; i++) { body; } → 0.9 (restructured but equivalent)
Decompiler C: loop: if (i >= n) goto end; → 0.3 (loop present, not recovered)
body; i++; goto loop; end:
do-while loops and loops with complex break/continue are harder than while loops for most tools; scores tend to drop there first.
Storage Layout Recovery
Original:
slot 0: owner (address)
slot 1: totalSupply (uint256)
slot 2: balances (mapping(address => uint256))
Decompiler output:
slot 0: var_0 (address) ✓ type correct
slot 1: var_1 (uint256) ✓ type correct
slot 2: mapping_2 (mapping(?=>?)) △ mapping detected, key/value types unknown
Scores:
Slot detection: 3/3 = 100%
Type recovery: 2.5/3 = 83%
Name recovery: 0/3 = 0% (no source available)
Name recovery is expected to be zero without verified source; it’s not a useful signal on its own.
Dimension 2: Semantic Preservation
Semantic preservation asks whether decompiled code behaves identically to the original — same inputs produce same outputs and same state transitions.
Input-Output Equivalence
For each function F:
Generate a test input set I = {i1, i2, ..., in}
For each input i:
result_orig = EVM.execute(bytecode, i)
result_decomp = compiler.compile(decompiled).execute(i)
pass = (result_orig == result_decomp)
IO_Equivalence = passed_tests / total_tests
Because you can’t exhaustively test all inputs, the confidence interval depends on how much of the input space you’ve covered — boundary values (0, 1, MAX_UINT), random samples, and symbolically derived inputs that reach each branch.
Gas Consumption Preservation
Decompiled code won’t have identical gas cost (the compiler may emit different opcode sequences), but the profile should be close:
Gas preservation score = 1 - |gas_original - gas_decompiled| / gas_original
A difference under 10% is generally acceptable for security analysis purposes — you’re checking that the structural transformation didn’t add significant computation, not that the output is gas-optimal.
State Transition Equivalence
Test: transfer(alice, 100)
Expected state transitions:
balances[sender]: 1000 → 900
balances[alice]: 0 → 100
Decompiled version must produce:
- Same slots modified
- Same final values
- Same write ordering (ordering matters for reentrancy analysis)
State_Equivalence = matching_transitions / total_transitions
Write ordering is often overlooked but is critical when you’re analyzing reentrancy: an analysis that can’t distinguish “write then call” from “call then write” will produce incorrect results.
Event Emission Equivalence
Original: emit Transfer(sender, recipient, amount)
Check:
- Same LOG opcode with same topics
- Same indexed parameter values
- Same non-indexed data
- Same position in the execution trace
Event_Score = matching_events / total_events
Dimension 3: Security Analysis Utility
This is the dimension that matters most for security-focused decompilation. Can the tool detect the same vulnerabilities in decompiled output that it would find in verified source?
Vulnerability Detection Preservation
Ground truth (from verified source):
- Reentrancy in withdraw() [HIGH]
- Unchecked return in transfer() [MEDIUM]
- Centralization risk in admin() [INFO]
Decompiled analysis results:
- Reentrancy detected ✓ (true positive)
- Unchecked return detected ✓ (true positive)
- Centralization risk detected ✓ (true positive)
- Integer overflow reported ✗ (false positive — contract uses 0.8+)
Precision: 3/4 = 75%
Recall: 3/3 = 100%
F1: 86%
The false positive here is a common category: compilers since Solidity 0.8 emit overflow checks that a bytecode analyzer must recognize as such, or it reports overflow on every addition.
Taint Analysis Accuracy
Expected taint path:
CALLDATALOAD → MUL → ADD → SSTORE(slot_x)
Decompiled path:
input = _calldata[4:36]
computed = input * price + fee
storage[slot_x] = computed
Check: taint origin and sink preserved through the transformation
Taint_Accuracy = correctly_traced_paths / total_taint_paths
Pattern Matching Effectiveness
Reentrancy in bytecode: SLOAD → CALL → SSTORE
The corresponding decompiled pattern:
uint bal = balances[user];
user.call{value: bal}("");
balances[user] = 0;
A decompiler that preserves this ordering should produce output where the same reentrancy detector fires. One that reorders statements as part of “simplification” will cause false negatives.
Pattern_Score = true_matches / (true_matches + false_matches)
A Benchmark Suite
For these metrics to be useful, they need a standard corpus. A reasonable benchmark design:
| Category | Count | Purpose |
|---|---|---|
| Standard contracts (ERC20, ERC721, simple DeFi) | 100 | Baseline structural and semantic scores |
| Complex DeFi (lending, AMMs, yield aggregators) | 50 | Complex control flow, storage patterns |
| Adversarial contracts (obfuscated, unusual compiler output) | 50 | Robustness under adversarial conditions |
| Known-vulnerable historical contracts | 30 | Vulnerability detection recall |
| Multi-contract systems (proxies, diamonds) | 20 | Proxy resolution, cross-contract patterns |
Ground truth sources, in decreasing reliability:
- Verified source code from Etherscan — compile with original settings and verify the bytecode hash matches
- Manual annotation by experts — function boundaries, security issues
- Multi-decompiler consensus — where multiple tools agree, treat as ground truth (but handle carefully; systematic errors across tools will look like consensus)
Composite Score
S = w1 * Structural + w2 * Semantic + w3 * Security
Structural = 0.3 * FBA + 0.3 * CFG + 0.2 * Loop + 0.2 * Storage
Semantic = 0.4 * IO + 0.2 * Gas + 0.3 * State + 0.1 * Events
Security = 0.3 * VulnDetect + 0.3 * Taint + 0.2 * Pattern + 0.2 * Invariant
Suggested weights:
w1 = 0.25 (structural)
w2 = 0.25 (semantic)
w3 = 0.50 (security — most important for security-focused tools)
Known Hard Cases
Any benchmark should explicitly call out the categories where tools typically struggle:
Loop recovery. do-while loops and loops with multiple break/continue sites are harder to recover than simple while loops. Failure here typically produces working but less readable output.
Obfuscated contracts. Contracts compiled with optimizer settings designed to produce unusual code, or hand-written assembly, drop scores significantly across all dimensions for most tools. This is an active research area.
Cross-contract invariants. Properties that span multiple contracts (e.g., a lending protocol’s invariant that total borrows don’t exceed total collateral requires reading state from multiple contracts) require cross-contract analysis, which most tools don’t fully support.
Layer 2 bytecode. Optimism, Arbitrum, and other L2s have EVM-compatible but not EVM-identical execution environments. Bytecode compiled for L2 deployment may include L2-specific precompile calls or opcodes. The benchmark should be extended to cover these.
“Accuracy” is a marketing term until you define what you’re measuring. The framework here — structural fidelity, semantic preservation, security utility — gives a three-dimensional view that’s useful for both tool development and tool selection. A decompiler might score well on structural fidelity (correct function boundaries, correct CFG) while having poor semantic preservation (reorders statements), which would show up immediately in the security utility scores even if manual inspection looked fine.
For security work, the F1 score on vulnerability detection is the number that matters most. Everything else is in service of that.