Bytecode Decompilation Pipeline
EVM Bytecode to Readable Solidity
A technical walkthrough of the multi-stage EVM decompilation pipeline: how raw bytecode is parsed, lifted through two intermediate representations, and emitted as readable Solidity or Yul.
Bytecode Decompilation Pipeline
Most analysis tools require Solidity source—a static analyzer that operates on ASTs, a symbolic executor that maps source locations to bytecode, a human auditor reading a diff. The problem is that a large fraction of deployed contract bytecode has no corresponding verified source on Etherscan or any other verification registry. Estimates vary, but analyses of Ethereum mainnet have consistently found that fewer than half of unique contract deployments have verified source available. That gap includes intentionally private protocols, contracts from defunct teams, factory-deployed clones, and proxy implementations that were never separately verified.
Decompilation closes that gap by recovering a Solidity-equivalent representation from the bytecode itself. This document describes how that reconstruction works in practice.
What Decompilation Can and Cannot Recover
Before walking through the stages, it helps to be clear about what bytecode compilation destroys:
Permanently lost: variable names, function names not in a known selector database, inline comments, original formatting, intermediate expression names.
Recoverable with high confidence: control flow structure, storage layout, function boundaries, type information (from usage patterns), function signatures (for most public functions, where the 4-byte selector appears in known signature databases), event signatures (from LOG topics), proxy relationships.
Recoverable with lower confidence: loop structure (vs. equivalent tail recursion), whether a pattern is a modifier vs. inlined code, precise intent of complex assembly blocks.
The goal is not to produce the original source. It is to produce an equivalent, readable representation that a human can audit and a tool can analyze. Those are different targets, and the pipeline is designed for both.
Why Two Intermediate Representations?
A single IR forces a choice: either stay close to the EVM execution model (accurate, but hard to read) or push toward Solidity (readable, but lossy for analysis). The pipeline uses two representations rather than one, with different passes targeting each.
LIR (Low-Level IR) stays close to the EVM. Stack operations are converted to named temporaries and JUMP targets are resolved into a control flow graph, but the representation is still statement-by-statement and type-free. Passes that need to reason accurately about execution order—particularly CFG construction and function boundary detection—work at this level.
HIR (High-Level IR) lifts the representation to something closer to a structured language. Variables have inferred types, storage accesses are named, stack slots are replaced with SSA variables, and common patterns like conditional reverts are recognized as require(). Analysis passes and the backend both target HIR.
Stage 1: Frontend (Bytecode → LIR)
Input: raw bytecode hex string.
The frontend does four things in sequence:
Parsing: validates the bytecode format, extracts embedded metadata (compiler version, IPFS hash if present), and separates the constructor from the deployed code.
Disassembly: converts bytes to opcodes, resolves PUSH immediate values, and identifies basic block boundaries—sequences of instructions that always execute together without branching.
CFG construction: identifies all JUMP and JUMPI targets (which requires resolving what value is on the stack at each jump site, itself a small analysis problem), links basic blocks into a control flow graph, and resolves function dispatcher jump tables to identify which selector maps to which block.
Function detection: the dispatcher pattern (a sequence of CALLDATALOAD, PUSH4, EQ, JUMPI blocks) identifies function entry points and their 4-byte selectors. The fallback and receive functions are identified by the absence of a selector comparison. Internal functions are detected by calling convention—jump-to rather than selector-dispatch.
Output example:
Bytecode fragment:
60 80 60 40 52 34 80 15 61 00 0f 57 60 00 80 fd
Disassembly:
PUSH1 0x80
PUSH1 0x40
MSTORE
CALLVALUE
DUP1
ISZERO
PUSH2 0x000f
JUMPI
PUSH1 0x00
DUP1
REVERT
LIR (after CFG construction):
Block 0:
t0 = MSTORE(0x40, 0x80)
t1 = CALLVALUE
t2 = ISZERO(t1)
JUMPI(Block 2, Block 1, t2)
Block 1:
REVERT(0x00, 0x00)
Block 2:
[continues...]
Stage 2: HIR Transformation (LIR → HIR)
This stage does the heavy lifting: it assigns types, names storage variables, promotes patterns to high-level constructs, and converts the CFG to SSA form.
SSA Conversion
Static Single Assignment form rewrites every variable so that each is assigned exactly once. Where two control flow paths produce different values for the same variable, a φ (phi) node merges them. SSA enables the analysis passes in Stage 3 to reason about data flow without tracking aliasing.
Type Inference
Types are inferred from how values are used, not from annotations (which don’t exist in bytecode). The inference rules are straightforward:
| Usage Pattern | Inferred Type |
|---|---|
Used as a CALL target or compared to msg.sender | address |
Result of EQ, LT, GT, ISZERO; used as branch condition | bool |
Passed to KECCAK256; used as event topic | bytes32 |
Used with signed arithmetic (SDIV, SMOD, SLT, SGT) | int256 |
| Used with unsigned arithmetic; token balance patterns | uint256 |
| Read/written byte-by-byte with masking | bytes variants |
Types propagate forward through data flow: if x is typed as address, then y = x inherits that type.
Storage Analysis
Every SLOAD and SSTORE is catalogued by slot. Three patterns are common:
- Fixed slot:
SLOAD(0x00)— a scalar state variable. Named by usage context (e.g., compared tomsg.sender→owner). - Mapping access:
SLOAD(KECCAK256(key, base_slot))— recognized by the keccak pattern; the base slot is the mapping variable, key is the lookup parameter. - Array access:
SLOAD(ADD(base_slot, index))— sequential slots starting from a base.
Reconstructed storage layout example:
Slot 0x00: written in constructor, read in modifier, type=address → "owner"
Slot 0x01 (mapping): keccak256(user_address, 0x01), type=uint256 → "mapping(address => uint256) balances"
Slot 0x02: sequential reads at 0x02 + offset, type=address[] → "address[] participants"
Semantic Lifting
Repeated opcode sequences that implement Solidity constructs are recognized and replaced with their high-level equivalent:
ISZERO + JUMPI + REVERTwith a string in returndata →require(condition, "message")INVALIDafter a condition check →assert(condition)CALLwithvalueparameter → candidate for.transfer()or.call{value: ...}()LOG3with a known topic hash →emit Transfer(...)(after selector lookup)
OpenZeppelin’s standard contract patterns (Ownable, ReentrancyGuard, AccessControl, Pausable) all produce characteristic bytecode sequences that are matched at this stage and annotated.
Transformation example:
LIR:
t0 = CALLER
t1 = SLOAD(0x00)
t2 = EQ(t0, t1)
t3 = ISZERO(t2)
JUMPI(revert_block, continue_block, t3)
HIR:
owner: address = storage[0x00] // named from usage
require(msg.sender == owner, "Not owner")
Stage 3: Optimization Passes
Eight passes run on the HIR to improve readability without changing semantics:
- Constant folding: evaluates expressions with all-constant operands.
keccak256("Transfer(address,address,uint256)")→0xddf252ad... - Constant propagation: substitutes known constants for variables.
- Dead code elimination: removes unreachable blocks and unused SSA definitions.
- Common subexpression elimination: deduplicates expressions computed more than once.
- Copy propagation: eliminates trivial copies (
x = y; z = x→z = y). - Variable renaming: replaces SSA temporaries with semantic names inferred from usage context and the function signature database.
- Control flow simplification: merges empty blocks, eliminates trivially-true conditions, converts dispatcher jump tables to switch-equivalent structure.
- Loop reconstruction: recognizes back-edges and classifies them as
for,while, ordo-whilebased on where the condition check appears relative to the loop body.
Before/after example:
Before:
v_0 = 100
v_1 = 200
v_2 = ADD(v_0, v_1)
v_3 = v_2
v_4 = ADD(v_2, 0)
if (false) { REVERT }
RETURN(v_3)
After:
RETURN(300)
Stage 4: Code Generation
The backend consumes optimized HIR and emits Solidity (or Yul for ambiguous cases).
Solidity backend generates a complete contract structure: state variable declarations ordered by slot, function signatures with inferred visibility and type annotations, events reconstructed from LOG operations, and require/revert statements with extracted error strings where present.
Example output for a simple withdrawal contract:
// SPDX-License-Identifier: UNLICENSED
pragma solidity ^0.8.0;
contract Decompiled {
address public owner; // Slot 0
mapping(address => uint256) public balances; // Slot 1
constructor() {
owner = msg.sender;
}
modifier onlyOwner() {
require(msg.sender == owner, "Not owner");
_;
}
function withdraw(uint256 amount) external {
require(balances[msg.sender] >= amount, "Insufficient balance");
balances[msg.sender] -= amount;
payable(msg.sender).transfer(amount);
}
function setOwner(address newOwner) external onlyOwner {
require(newOwner != address(0), "Invalid address");
owner = newOwner;
}
}
Yul backend is used when Solidity reconstruction is ambiguous—inline assembly blocks that use arbitrary control flow, optimized dispatcher patterns that don’t map cleanly to Solidity switch statements, or cases where the Solidity type system cannot express what the bytecode does. Yul preserves exact semantics at the cost of reduced readability.
Proxy Handling
Proxy contracts are worth special treatment because the interesting logic lives in the implementation, not the proxy itself. The pipeline detects proxy patterns by storage slot:
- EIP-1967 transparent/UUPS: implementation address at
0x360894a13ba1a3210667c828492db98dca3e2076635130ab13d2125586ce1906 - EIP-1167 minimal proxy: 45-byte clone bytecode with embedded implementation address
- EIP-2535 diamond: facet registry accessed at
0xc8fcad8db84d3cc18b4c41d551ea0ee66dd599cde068d998e57d5e09332c131c
When a proxy is detected, the pipeline fetches the implementation’s bytecode, decompiles it independently, and produces a unified view that shows the proxy’s storage layout alongside the implementation’s function logic. Delegatecall boundaries are annotated in the output.
Accuracy Benchmarks
The pipeline was evaluated against a corpus of approximately 10,000 contracts with verified source on Etherscan, where the original source serves as ground truth. Each metric below compares the decompiled output against the verified source, measured as of early 2026. Results vary with contract complexity, optimization settings, and proxy depth:
| Metric | Result |
|---|---|
| Functional equivalence (same behavior) | 98.7% |
| Control flow accuracy (CFG matches source-compiled CFG) | 99.2% |
| Type inference accuracy | 94.3% |
| Storage layout accuracy | 91.8% |
| Function signature match (known functions only) | 87.5% |
These numbers reflect contracts compiled with standard Solidity optimizer settings. Contracts with aggressive optimization (assembly-heavy, loop-unrolled) score lower on storage layout accuracy; contracts with heavy proxy nesting score lower on type accuracy.
Performance on a recent server-class machine:
| Contract size | Time | Memory |
|---|---|---|
| < 5 KB | < 0.5 s | ~10 MB |
| 5–20 KB | 0.5–2 s | ~50 MB |
| 20–50 KB | 2–5 s | ~150 MB |
| > 50 KB | 5–15 s | ~300 MB |
Known Limitations
Jump tables with computed targets: Some function dispatchers use computed jumps (JUMP where the target is derived from calldata) rather than linear selector comparisons. Static CFG construction cannot resolve these targets in all cases. Where symbolic execution is needed to resolve jump targets, the pipeline falls back to preserving the assembly block in Yul.
Compiler-version sensitivity: Solidity’s code generation has changed significantly between versions. Patterns that identify require() differ between solc 0.4.x and 0.8.x. The pipeline detects the compiler version from embedded bytecode metadata where available, but contracts compiled without metadata (or with metadata stripped) require heuristic version detection.
Optimizer-heavy output: Contracts compiled with aggressive optimization settings, particularly with inlining and common subexpression hoisting, produce bytecode that does not correspond to any natural Solidity structure. The decompiled output for these contracts is correct but less readable—more Yul-like even in the Solidity backend.