Intermediate Representations in Smart Contract Analysis: From Bytecode to Semantic Understanding
Raw EVM bytecode is a sequence of stack operations—efficient for execution, difficult for analysis. A flat stream of opcodes has no variables, no functions, no loops—just push, pop, and jump. Intermediate Representations (IRs) are the bridge between that execution-level form and the semantic understanding needed for security analysis.
The idea is to lift bytecode through a series of progressively more abstract representations, each one enabling different kinds of analysis that the previous level couldn’t support.
Why Multiple Levels?
Consider the difference between analyzing:
6080604052348015610010...
versus:
function withdraw(uint amount) external {
require(balances[msg.sender] >= amount);
msg.sender.call{value: amount}("");
balances[msg.sender] -= amount;
}
The second form makes the reentrancy vulnerability obvious. The first form contains exactly the same information, but extracting it requires understanding the stack layout, the mapping access pattern, and the function dispatch structure.
IR lifting is how decompilers do that extraction.
The typical pipeline for EVM analysis has four levels:
Raw Bytecode
↓ (disassembly + stack-to-variable)
LIR — Low-level IR: explicit stack operations, no structure
↓ (SSA conversion, control flow)
MIR — Medium-level IR: named variables, basic blocks, data flow
↓ (control flow structuring, type inference)
HIR — High-level IR: functions, loops, typed expressions
↓ (source reconstruction)
AST — Abstract Syntax Tree: source-like, query-ready
Each level adds abstraction and loses some low-level detail.
LIR: Low-Level IR
LIR converts bytecode into a form where stack operations are made explicit. Each stack slot becomes a named variable, and each instruction becomes a statement:
Bytecode:
60 04 PUSH1 0x04
35 CALLDATALOAD
60 00 PUSH1 0x00
54 SLOAD
10 LT
60 20 PUSH1 0x20
57 JUMPI
LIR:
block_0:
v0 = PUSH 0x04
v1 = CALLDATALOAD v0
v2 = PUSH 0x00
v3 = SLOAD v2
v4 = LT v1 v3
v5 = PUSH 0x20
JUMPI v5 v4 -> block_1, block_2
At this level, you can do gas cost calculation (every instruction maps to a known gas cost), opcode sequence pattern matching, and basic coverage analysis. You can find the pattern SLOAD → CALL → SSTORE on the same slot by linear scan—this is O(n²) but catches reentrancy candidates quickly.
LIR is faithful to bytecode, which matters: some analysis needs to happen at this level precisely because it’s before any abstraction that might lose information.
MIR: Medium-Level IR with SSA
MIR eliminates the stack entirely, converting to Static Single Assignment (SSA) form where every variable is assigned exactly once. This may look like a small change, but it has large consequences for analysis.
LIR:
v0 = PUSH 0x04
v1 = CALLDATALOAD v0
v2 = PUSH 0x00
v3 = SLOAD v2
v4 = LT v1 v3
JUMPI block_1 v4
MIR:
%0 = calldataload(4)
%1 = sload(0)
%2 = lt(%0, %1)
br %2, @block_1, @block_2
SSA’s single-assignment property means that every use of a variable unambiguously points to its definition. Use-def chains become trivial to compute. When control flow merges, phi nodes make the join explicit:
@block_0:
%a = 10
br @block_2
@block_1:
%b = 20
br @block_2
@block_2:
%c = phi(%a from @block_0, %b from @block_1)
SSA enables several important optimizations that also serve analysis:
Constant propagation collapses chains of known-value operations, revealing the actual values flowing into security-sensitive operations:
Before:
%0 = 100
%1 = add(%0, 50)
%2 = mul(%1, 2)
After:
%2 = 300
Dead code elimination removes instructions whose results are never used, reducing noise in analysis:
Before:
%0 = calldataload(4)
%1 = sload(0) // never used
%2 = add(%0, 10)
return %2
After:
%0 = calldataload(4)
%2 = add(%0, 10)
return %2
Most importantly for security analysis, MIR is where taint analysis works best. Taint analysis tracks how data flows from sources (user input, oracle prices) to sinks (ETH transfers, storage writes):
Taint Sources:
%input = calldataload(*) // TAINTED: user-controlled
%sender = caller() // TAINTED: user-controlled
Propagation:
%tainted = add(%input, 5) // TAINTED: input + constant
%clean = sload(0) // CLEAN: storage read
%mixed = mul(%tainted, %clean) // TAINTED: any tainted operand propagates
Sinks:
call(%addr, %tainted_value) // ALERT: tainted value in call
sstore(%slot, %tainted) // ALERT: tainted value to storage
This catches a different class of bugs from pattern matching: not “this opcode sequence matches a known exploit pattern” but “user-controlled data reaches a dangerous operation without sanitization.”
HIR: High-Level IR
HIR recovers structured control flow—loops, conditionals, and function boundaries—from the flat graph of basic blocks.
The key operation is recognizing structural patterns in the CFG:
If-then-else: a block with two successors that reconverge at a join point.
CFG: @A → @B (condition true)
@A → @C (condition false)
@B → @D
@C → @D
HIR:
if (condition) { /* block B */ }
else { /* block C */ }
// block D continues
Loops: a back edge (an edge from a later block to an earlier one) indicates a loop.
CFG: @header → @body (continue)
@header → @exit (done)
@body → @header (back edge)
HIR:
while (condition) { /* body */ }
HIR is also where type recovery happens. EVM bytecode has no type annotations, but type information is implicit in how values are used. The AND with 0xffffffffffffffffffffffffffffffffffffffff is a mask that extracts an address:
MIR:
%0 = calldataload(4)
%1 = and(%0, 0xffffffffffffffffffffffffffffffffffffffff)
Type inference: AND with address mask → type is address
HIR:
address arg0 = address(calldataload(4))
Mapping access is recognizable by the keccak256(key, slot) pattern:
MIR:
%key = calldataload(4)
%slot = keccak256(%key, 0)
%value = sload(%slot)
HIR:
mapping_0[key]
Sequential storage slot access suggests a struct:
MIR:
%base = sload(5)
%field1 = sload(6)
%field2 = sload(7)
HIR:
struct_5.field0, struct_5.field1, struct_5.field2
At the HIR level, high-level pattern matching becomes possible—matching against structured control flow rather than opcode sequences.
AST: Abstract Syntax Tree
The AST represents code as a tree of syntactic constructs. For a simple token transfer:
FunctionDeclaration
├── name: "transfer"
├── params: [("to", Address), ("amount", Uint256)]
├── body: Block
│ ├── RequireStatement
│ │ └── BinaryOp(>=)
│ │ ├── MappingAccess(balances, MsgSender)
│ │ └── Identifier("amount")
│ ├── AssignmentStatement(-=)
│ │ ├── MappingAccess(balances, MsgSender)
│ │ └── Identifier("amount")
│ ├── AssignmentStatement(+=)
│ │ ├── MappingAccess(balances, to)
│ │ └── Identifier("amount")
│ └── ReturnStatement(true)
└── returns: Bool
The AST is where semantic queries become natural:
Reentrancy pattern query:
MATCH (block:Block)
WHERE block.contains(StateRead(loc))
AND block.contains(ExternalCall())
AND block.contains(StateWrite(loc))
AND position(ExternalCall) < position(StateWrite)
RETURN block as vulnerable
Semantic query:
"Find all functions that can transfer ETH"
MATCH (f:Function)
WHERE f.body.descendants().any(
node => node.type == 'Call' AND node.value > 0
)
RETURN f
AST queries express intent in terms of what the code does, not how it does it. The representation also produces human-readable output, which is necessary for any tool that presents findings to a human reviewer.
Matching Analysis to IR Level
Different analyses belong at different IR levels, and running them at the right level is important for both accuracy and performance:
| Analysis | Best IR Level | Reason |
|---|---|---|
| Gas profiling | LIR | Needs instruction-level fidelity |
| Opcode sequence patterns | LIR | Works before structural recovery |
| Data flow / taint tracking | MIR | SSA makes def-use chains trivial |
| Constant propagation | MIR | SSA is required |
| Reentrancy detection (full) | All levels | LIR for candidates, MIR for confirmation, HIR for context |
| Type recovery | HIR | Needs control flow structure |
| Storage layout | HIR | Needs variable resolution |
| Semantic vulnerability patterns | AST | Needs full structural understanding |
| Source reconstruction | AST | Obvious |
Cross-level analysis often produces the best results. Reentrancy detection, for example, works well as a pipeline: LIR scan to find CALL opcodes with nearby SSTORE/SLOAD instructions; MIR data flow to confirm that the SLOAD and SSTORE reference the same logical storage slot; HIR to identify function boundaries and understand the call context; AST to generate a human-readable report with suggested remediation.
Handling Ambiguity
Bytecode often has multiple valid interpretations, and the IR must either resolve them or represent the ambiguity explicitly:
Bytecode pattern:
PUSH1 0x01
PUSH1 0x00
SSTORE
Could mean:
storage[0] = 1 // simple assignment
flag = true // boolean flag
status = ACTIVE // enum value
Resolution: use context (how is slot 0 used elsewhere?),
type propagation (what types flow to this slot?),
or remain ambiguous (report all interpretations)
The alternative to handling ambiguity well is producing either false positives (flagging safe code) or false negatives (missing real vulnerabilities). Getting IR transformations right is most of the work in building a decompiler that produces useful security findings rather than noise.