Measuring What Matters: A Framework for EVM Decompiler Accuracy Metrics

Claims of “95% accuracy” or “best-in-class recovery” appear in EVM decompiler documentation without any shared definition of what’s being measured. Without agreed-upon metrics, you can’t compare tools, detect regressions, or know which shortcomings are worth fixing first.

This post proposes a framework for evaluating EVM decompiler output across three dimensions: structural fidelity, semantic preservation, and security analysis utility.

Why Current Evaluation Approaches Fall Short

When a decompiler transforms bytecode into readable code, there are three common ways people evaluate whether it did a good job — and all three have problems.

Manual inspection. Experts read the output and decide if it looks right. This doesn’t scale, it’s subjective, and it reliably misses subtle errors that have no visible surface.

Recompilation testing. Decompile, then compile the decompiled output back, then compare bytecodes. This fails because Solidity compilation is non-deterministic across versions and optimization settings. Different valid source can produce identical bytecode. You can’t use the compiler as a correctness oracle.

Round-trip testing. A variant of recompilation testing, with the same problems. Compilation is lossy; it papers over errors in the decompiled source before the comparison happens.

A useful metric needs to capture what actually matters for the use case — which, for security-focused decompilation, is whether the output supports correct vulnerability detection.

Dimension 1: Structural Fidelity

Structural fidelity asks whether the decompiled output reflects the organization of the original code.

Function Boundary Accuracy

Did the decompiler correctly identify where each function begins and ends?

Ground truth (from verified source):
  Function A: bytes 0x00-0x4f
  Function B: bytes 0x50-0x9f
  Function C: bytes 0xa0-0xff

Decompiler output:
  Function A: bytes 0x00-0x4f  ✓
  Function B: bytes 0x50-0x8f  ✗ truncated
  Function C: bytes 0x90-0xff  ✗ wrong start

Metrics:
  Function count accuracy: 3/3 = 100%
  Boundary precision:      1/3 = 33%
  Byte coverage:           240/256 = 94%

Formal definition:

FBA = |correctly bounded functions| / |total functions|

A function is "correctly bounded" when:
  - Start address matches a dispatcher target
  - End address includes all reachable basic blocks
  - No overlap with other functions

Control Flow Graph Preservation

Does the decompiled CFG match the bytecode-level CFG?

Original CFG:
  Block A → Block B (conditional)
  Block A → Block C (fallthrough)
  Block B → Block D
  Block C → Block D (merge)

Decompiled code:
  if (condition) { /* Block B */ }
  else { /* Block C */ }
  /* Block D */

Scores:
  Nodes: 4/4
  Edges: 4/4
  Structural isomorphism: yes

A weighted metric:

CFG_Score = α * (matched_nodes / total_nodes)
          + β * (matched_edges / total_edges)
          + γ * isomorphism_bonus

α + β + γ = 1

Loop Recovery

Original: while (i < n) { body; i++; }

Decompiler A: while (i < n) { body; i++; }     → 1.0
Decompiler B: for (i; i < n; i++) { body; }    → 0.9 (restructured but equivalent)
Decompiler C: loop: if (i >= n) goto end;      → 0.3 (loop present, not recovered)
              body; i++; goto loop; end:

do-while loops and loops with complex break/continue are harder than while loops for most tools; scores tend to drop there first.

Storage Layout Recovery

Original:
  slot 0: owner (address)
  slot 1: totalSupply (uint256)
  slot 2: balances (mapping(address => uint256))

Decompiler output:
  slot 0: var_0 (address)            ✓ type correct
  slot 1: var_1 (uint256)            ✓ type correct
  slot 2: mapping_2 (mapping(?=>?))  △ mapping detected, key/value types unknown

Scores:
  Slot detection: 3/3 = 100%
  Type recovery:  2.5/3 = 83%
  Name recovery:  0/3 = 0% (no source available)

Name recovery is expected to be zero without verified source; it’s not a useful signal on its own.

Dimension 2: Semantic Preservation

Semantic preservation asks whether decompiled code behaves identically to the original — same inputs produce same outputs and same state transitions.

Input-Output Equivalence

For each function F:
  Generate a test input set I = {i1, i2, ..., in}
  For each input i:
    result_orig  = EVM.execute(bytecode, i)
    result_decomp = compiler.compile(decompiled).execute(i)
    pass = (result_orig == result_decomp)

IO_Equivalence = passed_tests / total_tests

Because you can’t exhaustively test all inputs, the confidence interval depends on how much of the input space you’ve covered — boundary values (0, 1, MAX_UINT), random samples, and symbolically derived inputs that reach each branch.

Gas Consumption Preservation

Decompiled code won’t have identical gas cost (the compiler may emit different opcode sequences), but the profile should be close:

Gas preservation score = 1 - |gas_original - gas_decompiled| / gas_original

A difference under 10% is generally acceptable for security analysis purposes — you’re checking that the structural transformation didn’t add significant computation, not that the output is gas-optimal.

State Transition Equivalence

Test: transfer(alice, 100)

Expected state transitions:
  balances[sender]: 1000 → 900
  balances[alice]:  0 → 100

Decompiled version must produce:
  - Same slots modified
  - Same final values
  - Same write ordering (ordering matters for reentrancy analysis)

State_Equivalence = matching_transitions / total_transitions

Write ordering is often overlooked but is critical when you’re analyzing reentrancy: an analysis that can’t distinguish “write then call” from “call then write” will produce incorrect results.

Event Emission Equivalence

Original: emit Transfer(sender, recipient, amount)

Check:
  - Same LOG opcode with same topics
  - Same indexed parameter values
  - Same non-indexed data
  - Same position in the execution trace

Event_Score = matching_events / total_events

Dimension 3: Security Analysis Utility

This is the dimension that matters most for security-focused decompilation. Can the tool detect the same vulnerabilities in decompiled output that it would find in verified source?

Vulnerability Detection Preservation

Ground truth (from verified source):
  - Reentrancy in withdraw() [HIGH]
  - Unchecked return in transfer() [MEDIUM]
  - Centralization risk in admin() [INFO]

Decompiled analysis results:
  - Reentrancy detected             ✓ (true positive)
  - Unchecked return detected       ✓ (true positive)
  - Centralization risk detected    ✓ (true positive)
  - Integer overflow reported       ✗ (false positive — contract uses 0.8+)

Precision: 3/4 = 75%
Recall:    3/3 = 100%
F1:        86%

The false positive here is a common category: compilers since Solidity 0.8 emit overflow checks that a bytecode analyzer must recognize as such, or it reports overflow on every addition.

Taint Analysis Accuracy

Expected taint path:
  CALLDATALOAD → MUL → ADD → SSTORE(slot_x)

Decompiled path:
  input = _calldata[4:36]
  computed = input * price + fee
  storage[slot_x] = computed

Check: taint origin and sink preserved through the transformation
Taint_Accuracy = correctly_traced_paths / total_taint_paths

Pattern Matching Effectiveness

Reentrancy in bytecode: SLOAD → CALL → SSTORE

The corresponding decompiled pattern:

uint bal = balances[user];
user.call{value: bal}("");
balances[user] = 0;

A decompiler that preserves this ordering should produce output where the same reentrancy detector fires. One that reorders statements as part of “simplification” will cause false negatives.

Pattern_Score = true_matches / (true_matches + false_matches)

A Benchmark Suite

For these metrics to be useful, they need a standard corpus. A reasonable benchmark design:

Category	Count	Purpose
Standard contracts (ERC20, ERC721, simple DeFi)	100	Baseline structural and semantic scores
Complex DeFi (lending, AMMs, yield aggregators)	50	Complex control flow, storage patterns
Adversarial contracts (obfuscated, unusual compiler output)	50	Robustness under adversarial conditions
Known-vulnerable historical contracts	30	Vulnerability detection recall
Multi-contract systems (proxies, diamonds)	20	Proxy resolution, cross-contract patterns

Ground truth sources, in decreasing reliability:

Verified source code from a public verified-source registry — compile with original settings and verify the bytecode hash matches
Manual annotation by experts — function boundaries, security issues
Multi-decompiler consensus — where multiple tools agree, treat as ground truth (but handle carefully; systematic errors across tools will look like consensus)

Composite Score

S = w1 * Structural + w2 * Semantic + w3 * Security

Structural = 0.3 * FBA + 0.3 * CFG + 0.2 * Loop + 0.2 * Storage
Semantic   = 0.4 * IO  + 0.2 * Gas + 0.3 * State + 0.1 * Events
Security   = 0.3 * VulnDetect + 0.3 * Taint + 0.2 * Pattern + 0.2 * Invariant

Suggested weights:
  w1 = 0.25  (structural)
  w2 = 0.25  (semantic)
  w3 = 0.50  (security — most important for security-focused tools)

Known Hard Cases

Any benchmark should explicitly call out the categories where tools typically struggle:

Loop recovery. do-while loops and loops with multiple break/continue sites are harder to recover than simple while loops. Failure here typically produces working but less readable output.

Obfuscated contracts. Contracts compiled with optimizer settings designed to produce unusual code, or hand-written assembly, drop scores significantly across all dimensions for most tools. This is an active research area.

Cross-contract invariants. Properties that span multiple contracts (e.g., a lending protocol’s invariant that total borrows don’t exceed total collateral requires reading state from multiple contracts) require cross-contract analysis, which most tools don’t fully support.

Layer 2 bytecode. Optimism, Arbitrum, and other L2s have EVM-compatible but not EVM-identical execution environments. Bytecode compiled for L2 deployment may include L2-specific precompile calls or opcodes. The benchmark should be extended to cover these.

“Accuracy” is a marketing term until you define what you’re measuring. The framework here — structural fidelity, semantic preservation, security utility — gives a three-dimensional view that’s useful for both tool development and tool selection. A decompiler might score well on structural fidelity (correct function boundaries, correct CFG) while having poor semantic preservation (reorders statements), which would show up immediately in the security utility scores even if manual inspection looked fine.

For security work, the F1 score on vulnerability detection is the number that matters most. Everything else is in service of that.