Bytecode Decompilation Pipeline

Most analysis tools require Solidity source—a static analyzer that operates on ASTs, a symbolic executor that maps source locations to bytecode, a human auditor reading a diff. The problem is that a large fraction of deployed contract bytecode has no corresponding verified source on any public verification registry. Estimates vary, but analyses of Ethereum mainnet have consistently found that fewer than half of unique contract deployments have verified source available. That gap includes intentionally private protocols, contracts from defunct teams, factory-deployed clones, and proxy implementations that were never separately verified.

Decompilation closes that gap by recovering a Solidity-equivalent representation from the bytecode itself. This document describes how that reconstruction works in practice.

What Decompilation Can and Cannot Recover

Before walking through the stages, it helps to be clear about what bytecode compilation destroys:

Permanently lost: variable names, function names not in a known selector database, inline comments, original formatting, intermediate expression names.

Recoverable with high confidence: control flow structure, storage layout, function boundaries, type information (from usage patterns), function signatures (for most public functions, where the 4-byte selector appears in known signature databases), event signatures (from LOG topics), proxy relationships.

Recoverable with lower confidence: loop structure (vs. equivalent tail recursion), whether a pattern is a modifier vs. inlined code, precise intent of complex assembly blocks.

The goal is not to produce the original source. It is to produce an equivalent, readable representation that a human can audit and a tool can analyze. Those are different targets, and the pipeline is designed for both.

Why Two Intermediate Representations?

A single IR forces a choice: either stay close to the EVM execution model (accurate, but hard to read) or push toward Solidity (readable, but lossy for analysis). The pipeline uses two representations rather than one, with different passes targeting each.

LIR (Low-Level IR) stays close to the EVM. Stack operations are converted to named temporaries and JUMP targets are resolved into a control flow graph, but the representation is still statement-by-statement and type-free. Passes that need to reason accurately about execution order—particularly CFG construction and function boundary detection—work at this level.

HIR (High-Level IR) lifts the representation to something closer to a structured language. Variables have inferred types, storage accesses are named, stack slots are replaced with SSA variables, and common patterns like conditional reverts are recognized as require(). Analysis passes and the backend both target HIR.

Stage 1: Frontend (Bytecode → LIR)

Input: raw bytecode hex string.

The frontend does four things in sequence:

Parsing: validates the bytecode format, extracts embedded metadata (compiler version, IPFS hash if present), and separates the constructor from the deployed code.

Disassembly: converts bytes to opcodes, resolves PUSH immediate values, and identifies basic block boundaries—sequences of instructions that always execute together without branching.

CFG construction: identifies all JUMP and JUMPI targets (which requires resolving what value is on the stack at each jump site, itself a small analysis problem), links basic blocks into a control flow graph, and resolves function dispatcher jump tables to identify which selector maps to which block.

Function detection: the dispatcher pattern (a sequence of CALLDATALOAD, PUSH4, EQ, JUMPI blocks) identifies function entry points and their 4-byte selectors. The fallback and receive functions are identified by the absence of a selector comparison. Internal functions are detected by calling convention—jump-to rather than selector-dispatch.

Output example:

Bytecode fragment:
60 80 60 40 52 34 80 15 61 00 0f 57 60 00 80 fd

Disassembly:
PUSH1 0x80
PUSH1 0x40
MSTORE
CALLVALUE
DUP1
ISZERO
PUSH2 0x000f
JUMPI
PUSH1 0x00
DUP1
REVERT

LIR (after CFG construction):
Block 0:
  t0 = MSTORE(0x40, 0x80)
  t1 = CALLVALUE
  t2 = ISZERO(t1)
  JUMPI(Block 2, Block 1, t2)

Block 1:
  REVERT(0x00, 0x00)

Block 2:
  [continues...]

Stage 2: HIR Transformation (LIR → HIR)

This stage does the heavy lifting: it assigns types, names storage variables, promotes patterns to high-level constructs, and converts the CFG to SSA form.

SSA Conversion

Static Single Assignment form rewrites every variable so that each is assigned exactly once. Where two control flow paths produce different values for the same variable, a φ (phi) node merges them. SSA enables the analysis passes in Stage 3 to reason about data flow without tracking aliasing.

Type Inference

Types are inferred from how values are used, not from annotations (which don’t exist in bytecode). The inference rules are straightforward:

Usage Pattern	Inferred Type
Used as a `CALL` target or compared to `msg.sender`	`address`
Result of `EQ`, `LT`, `GT`, `ISZERO`; used as branch condition	`bool`
Passed to `KECCAK256`; used as event topic	`bytes32`
Used with signed arithmetic (`SDIV`, `SMOD`, `SLT`, `SGT`)	`int256`
Used with unsigned arithmetic; token balance patterns	`uint256`
Read/written byte-by-byte with masking	`bytes` variants

Types propagate forward through data flow: if x is typed as address, then y = x inherits that type.

Storage Analysis

Every SLOAD and SSTORE is catalogued by slot. Three patterns are common:

Fixed slot: SLOAD(0x00) — a scalar state variable. Named by usage context (e.g., compared to msg.sender → owner).
Mapping access: SLOAD(KECCAK256(key, base_slot)) — recognized by the keccak pattern; the base slot is the mapping variable, key is the lookup parameter.
Array access: SLOAD(ADD(base_slot, index)) — sequential slots starting from a base.

Reconstructed storage layout example:

Slot 0x00: written in constructor, read in modifier, type=address → "owner"
Slot 0x01 (mapping): keccak256(user_address, 0x01), type=uint256 → "mapping(address => uint256) balances"
Slot 0x02: sequential reads at 0x02 + offset, type=address[] → "address[] participants"

Semantic Lifting

Repeated opcode sequences that implement Solidity constructs are recognized and replaced with their high-level equivalent:

ISZERO + JUMPI + REVERT with a string in returndata → require(condition, "message")
INVALID after a condition check → assert(condition)
CALL with value parameter → candidate for .transfer() or .call{value: ...}()
LOG3 with a known topic hash → emit Transfer(...) (after selector lookup)

Widely-adopted canonical contract patterns (Ownable, ReentrancyGuard, AccessControl, Pausable) all produce characteristic bytecode sequences that are matched at this stage and annotated.

Transformation example:

LIR:
  t0 = CALLER
  t1 = SLOAD(0x00)
  t2 = EQ(t0, t1)
  t3 = ISZERO(t2)
  JUMPI(revert_block, continue_block, t3)

HIR:
  owner: address = storage[0x00]  // named from usage
  require(msg.sender == owner, "Not owner")

Stage 3: Optimization Passes

Eight passes run on the HIR to improve readability without changing semantics:

Constant folding: evaluates expressions with all-constant operands. keccak256("Transfer(address,address,uint256)") → 0xddf252ad...
Constant propagation: substitutes known constants for variables.
Dead code elimination: removes unreachable blocks and unused SSA definitions.
Common subexpression elimination: deduplicates expressions computed more than once.
Copy propagation: eliminates trivial copies (x = y; z = x → z = y).
Variable renaming: replaces SSA temporaries with semantic names inferred from usage context and the function signature database.
Control flow simplification: merges empty blocks, eliminates trivially-true conditions, converts dispatcher jump tables to switch-equivalent structure.
Loop reconstruction: recognizes back-edges and classifies them as for, while, or do-while based on where the condition check appears relative to the loop body.

Before/after example:

Before:
  v_0 = 100
  v_1 = 200
  v_2 = ADD(v_0, v_1)
  v_3 = v_2
  v_4 = ADD(v_2, 0)
  if (false) { REVERT }
  RETURN(v_3)

After:
  RETURN(300)

Stage 4: Code Generation

The backend consumes optimized HIR and emits Solidity (or Yul for ambiguous cases).

Solidity backend generates a complete contract structure: state variable declarations ordered by slot, function signatures with inferred visibility and type annotations, events reconstructed from LOG operations, and require/revert statements with extracted error strings where present.

Example output for a simple withdrawal contract:

// SPDX-License-Identifier: UNLICENSED
pragma solidity ^0.8.0;

contract Decompiled {
    address public owner;                         // Slot 0
    mapping(address => uint256) public balances;  // Slot 1

    constructor() {
        owner = msg.sender;
    }

    modifier onlyOwner() {
        require(msg.sender == owner, "Not owner");
        _;
    }

    function withdraw(uint256 amount) external {
        require(balances[msg.sender] >= amount, "Insufficient balance");
        balances[msg.sender] -= amount;
        payable(msg.sender).transfer(amount);
    }

    function setOwner(address newOwner) external onlyOwner {
        require(newOwner != address(0), "Invalid address");
        owner = newOwner;
    }
}

Yul backend is used when Solidity reconstruction is ambiguous—inline assembly blocks that use arbitrary control flow, optimized dispatcher patterns that don’t map cleanly to Solidity switch statements, or cases where the Solidity type system cannot express what the bytecode does. Yul preserves exact semantics at the cost of reduced readability.

Proxy Handling

Proxy contracts are worth special treatment because the interesting logic lives in the implementation, not the proxy itself. The pipeline detects proxy patterns by storage slot:

EIP-1967 transparent/UUPS: implementation address at 0x360894a13ba1a3210667c828492db98dca3e2076635130ab13d2125586ce1906
EIP-1167 minimal proxy: 45-byte clone bytecode with embedded implementation address
EIP-2535 diamond: facet registry accessed at 0xc8fcad8db84d3cc18b4c41d551ea0ee66dd599cde068d998e57d5e09332c131c

When a proxy is detected, the pipeline fetches the implementation’s bytecode, decompiles it independently, and produces a unified view that shows the proxy’s storage layout alongside the implementation’s function logic. Delegatecall boundaries are annotated in the output.

Accuracy Benchmarks

The pipeline was evaluated against an internal corpus of roughly 10,000 contracts with verified source drawn from public verification registries, where the original source serves as ground truth. These are internal measurements on that corpus, not a published third-party benchmark, and they should be read as approximate. Each metric below compares the decompiled output against the verified source. Results vary with contract complexity, optimization settings, and proxy depth:

Metric	Approximate result
Functional equivalence (same behavior)	~98%
Control flow accuracy (CFG matches source-compiled CFG)	~99%
Type inference accuracy	~94%
Storage layout accuracy	~92%
Function signature match (known functions only)	~88%

These figures reflect contracts compiled with standard Solidity optimizer settings. Contracts with aggressive optimization (assembly-heavy, loop-unrolled) score lower on storage layout accuracy; contracts with heavy proxy nesting score lower on type accuracy. Because the corpus skews toward verified — and therefore more conventional — contracts, the numbers are best treated as an upper bound rather than a guarantee for arbitrary bytecode.

Performance on a recent server-class machine:

Contract size	Time	Memory
< 5 KB	< 0.5 s	~10 MB
5–20 KB	0.5–2 s	~50 MB
20–50 KB	2–5 s	~150 MB
> 50 KB	5–15 s	~300 MB

Known Limitations

Jump tables with computed targets: Some function dispatchers use computed jumps (JUMP where the target is derived from calldata) rather than linear selector comparisons. Static CFG construction cannot resolve these targets in all cases. Where symbolic execution is needed to resolve jump targets, the pipeline falls back to preserving the assembly block in Yul.

Compiler-version sensitivity: Solidity’s code generation has changed significantly between versions. Patterns that identify require() differ between solc 0.4.x and 0.8.x. The pipeline detects the compiler version from embedded bytecode metadata where available, but contracts compiled without metadata (or with metadata stripped) require heuristic version detection.

Optimizer-heavy output: Contracts compiled with aggressive optimization settings, particularly with inlining and common subexpression hoisting, produce bytecode that does not correspond to any natural Solidity structure. The decompiled output for these contracts is correct but less readable—more Yul-like even in the Solidity backend.