Intermediate Representations in Smart Contract Analysis: From Bytecode to Semantic Understanding

Raw EVM bytecode is a sequence of stack operations—efficient for execution, difficult for analysis. A flat stream of opcodes has no variables, no functions, no loops—just push, pop, and jump. Intermediate Representations (IRs) are the bridge between that execution-level form and the semantic understanding needed for security analysis.

The idea is to lift bytecode through a series of progressively more abstract representations, each one enabling different kinds of analysis that the previous level couldn’t support.

Why Multiple Levels?

Consider the difference between analyzing:

6080604052348015610010...

versus:

function withdraw(uint amount) external {
    require(balances[msg.sender] >= amount);
    msg.sender.call{value: amount}("");
    balances[msg.sender] -= amount;
}

The second form makes the reentrancy vulnerability obvious. The first form contains exactly the same information, but extracting it requires understanding the stack layout, the mapping access pattern, and the function dispatch structure.

IR lifting is how decompilers do that extraction.

The typical pipeline for EVM analysis has four levels:

Raw Bytecode
    ↓  (disassembly + stack-to-variable)
LIR — Low-level IR: explicit stack operations, no structure
    ↓  (SSA conversion, control flow)
MIR — Medium-level IR: named variables, basic blocks, data flow
    ↓  (control flow structuring, type inference)
HIR — High-level IR: functions, loops, typed expressions
    ↓  (source reconstruction)
AST — Abstract Syntax Tree: source-like, query-ready

Each level adds abstraction and loses some low-level detail.

LIR: Low-Level IR

LIR converts bytecode into a form where stack operations are made explicit. Each stack slot becomes a named variable, and each instruction becomes a statement:

Bytecode:
60 04        PUSH1 0x04
35           CALLDATALOAD
60 00        PUSH1 0x00
54           SLOAD
10           LT
60 20        PUSH1 0x20
57           JUMPI

LIR:
block_0:
    v0 = PUSH 0x04
    v1 = CALLDATALOAD v0
    v2 = PUSH 0x00
    v3 = SLOAD v2
    v4 = LT v1 v3
    v5 = PUSH 0x20
    JUMPI v5 v4 -> block_1, block_2

At this level, you can do gas cost calculation (every instruction maps to a known gas cost), opcode sequence pattern matching, and basic coverage analysis. You can find the pattern SLOAD → CALL → SSTORE on the same slot by linear scan—this is O(n²) but catches reentrancy candidates quickly.

LIR is faithful to bytecode, which matters: some analysis needs to happen at this level precisely because it’s before any abstraction that might lose information.

MIR: Medium-Level IR with SSA

MIR eliminates the stack entirely, converting to Static Single Assignment (SSA) form where every variable is assigned exactly once. This may look like a small change, but it has large consequences for analysis.

LIR:
    v0 = PUSH 0x04
    v1 = CALLDATALOAD v0
    v2 = PUSH 0x00
    v3 = SLOAD v2
    v4 = LT v1 v3
    JUMPI block_1 v4

MIR:
    %0 = calldataload(4)
    %1 = sload(0)
    %2 = lt(%0, %1)
    br %2, @block_1, @block_2

SSA’s single-assignment property means that every use of a variable unambiguously points to its definition. Use-def chains become trivial to compute. When control flow merges, phi nodes make the join explicit:

@block_0:
    %a = 10
    br @block_2

@block_1:
    %b = 20
    br @block_2

@block_2:
    %c = phi(%a from @block_0, %b from @block_1)

SSA enables several important optimizations that also serve analysis:

Constant propagation collapses chains of known-value operations, revealing the actual values flowing into security-sensitive operations:

Before:
    %0 = 100
    %1 = add(%0, 50)
    %2 = mul(%1, 2)

After:
    %2 = 300

Dead code elimination removes instructions whose results are never used, reducing noise in analysis:

Before:
    %0 = calldataload(4)
    %1 = sload(0)         // never used
    %2 = add(%0, 10)
    return %2

After:
    %0 = calldataload(4)
    %2 = add(%0, 10)
    return %2

Most importantly for security analysis, MIR is where taint analysis works best. Taint analysis tracks how data flows from sources (user input, oracle prices) to sinks (ETH transfers, storage writes):

Taint Sources:
  %input = calldataload(*)   // TAINTED: user-controlled
  %sender = caller()         // TAINTED: user-controlled

Propagation:
  %tainted = add(%input, 5)  // TAINTED: input + constant
  %clean = sload(0)          // CLEAN: storage read
  %mixed = mul(%tainted, %clean) // TAINTED: any tainted operand propagates

Sinks:
  call(%addr, %tainted_value)  // ALERT: tainted value in call
  sstore(%slot, %tainted)      // ALERT: tainted value to storage

This catches a different class of bugs from pattern matching: not “this opcode sequence matches a known exploit pattern” but “user-controlled data reaches a dangerous operation without sanitization.”

HIR: High-Level IR

HIR recovers structured control flow—loops, conditionals, and function boundaries—from the flat graph of basic blocks.

The key operation is recognizing structural patterns in the CFG:

If-then-else: a block with two successors that reconverge at a join point.

CFG: @A → @B (condition true)
     @A → @C (condition false)
     @B → @D
     @C → @D

HIR:
    if (condition) { /* block B */ }
    else { /* block C */ }
    // block D continues

Loops: a back edge (an edge from a later block to an earlier one) indicates a loop.

CFG: @header → @body (continue)
     @header → @exit (done)
     @body → @header (back edge)

HIR:
    while (condition) { /* body */ }

HIR is also where type recovery happens. EVM bytecode has no type annotations, but type information is implicit in how values are used. The AND with 0xffffffffffffffffffffffffffffffffffffffff is a mask that extracts an address:

MIR:
    %0 = calldataload(4)
    %1 = and(%0, 0xffffffffffffffffffffffffffffffffffffffff)

Type inference: AND with address mask → type is address

HIR:
    address arg0 = address(calldataload(4))

Mapping access is recognizable by the keccak256(key, slot) pattern:

MIR:
    %key = calldataload(4)
    %slot = keccak256(%key, 0)
    %value = sload(%slot)

HIR:
    mapping_0[key]

Sequential storage slot access suggests a struct:

MIR:
    %base = sload(5)
    %field1 = sload(6)
    %field2 = sload(7)

HIR:
    struct_5.field0, struct_5.field1, struct_5.field2

At the HIR level, high-level pattern matching becomes possible—matching against structured control flow rather than opcode sequences.

AST: Abstract Syntax Tree

The AST represents code as a tree of syntactic constructs. For a simple token transfer:

FunctionDeclaration
├── name: "transfer"
├── params: [("to", Address), ("amount", Uint256)]
├── body: Block
│   ├── RequireStatement
│   │   └── BinaryOp(>=)
│   │       ├── MappingAccess(balances, MsgSender)
│   │       └── Identifier("amount")
│   ├── AssignmentStatement(-=)
│   │   ├── MappingAccess(balances, MsgSender)
│   │   └── Identifier("amount")
│   ├── AssignmentStatement(+=)
│   │   ├── MappingAccess(balances, to)
│   │   └── Identifier("amount")
│   └── ReturnStatement(true)
└── returns: Bool

The AST is where semantic queries become natural:

Reentrancy pattern query:
  MATCH (block:Block)
  WHERE block.contains(StateRead(loc))
    AND block.contains(ExternalCall())
    AND block.contains(StateWrite(loc))
    AND position(ExternalCall) < position(StateWrite)
  RETURN block as vulnerable

Semantic query:
  "Find all functions that can transfer ETH"
  MATCH (f:Function)
  WHERE f.body.descendants().any(
    node => node.type == 'Call' AND node.value > 0
  )
  RETURN f

AST queries express intent in terms of what the code does, not how it does it. The representation also produces human-readable output, which is necessary for any tool that presents findings to a human reviewer.

Matching Analysis to IR Level

Different analyses belong at different IR levels, and running them at the right level is important for both accuracy and performance:

AnalysisBest IR LevelReason
Gas profilingLIRNeeds instruction-level fidelity
Opcode sequence patternsLIRWorks before structural recovery
Data flow / taint trackingMIRSSA makes def-use chains trivial
Constant propagationMIRSSA is required
Reentrancy detection (full)All levelsLIR for candidates, MIR for confirmation, HIR for context
Type recoveryHIRNeeds control flow structure
Storage layoutHIRNeeds variable resolution
Semantic vulnerability patternsASTNeeds full structural understanding
Source reconstructionASTObvious

Cross-level analysis often produces the best results. Reentrancy detection, for example, works well as a pipeline: LIR scan to find CALL opcodes with nearby SSTORE/SLOAD instructions; MIR data flow to confirm that the SLOAD and SSTORE reference the same logical storage slot; HIR to identify function boundaries and understand the call context; AST to generate a human-readable report with suggested remediation.

Handling Ambiguity

Bytecode often has multiple valid interpretations, and the IR must either resolve them or represent the ambiguity explicitly:

Bytecode pattern:
    PUSH1 0x01
    PUSH1 0x00
    SSTORE

Could mean:
    storage[0] = 1      // simple assignment
    flag = true         // boolean flag
    status = ACTIVE     // enum value

Resolution: use context (how is slot 0 used elsewhere?),
            type propagation (what types flow to this slot?),
            or remain ambiguous (report all interpretations)

The alternative to handling ambiguity well is producing either false positives (flagging safe code) or false negatives (missing real vulnerabilities). Getting IR transformations right is most of the work in building a decompiler that produces useful security findings rather than noise.