Semantic Lifting
Beyond Opcode Translation
How Sigvex reconstructs high-level semantics from raw bytecode, recovering types, structures, and patterns that compilation erased—and why this foundation is essential for accurate vulnerability detection.
Semantic Lifting
Decompilation is more than reversing opcodes. To analyze a contract meaningfully, you need to understand what the code is doing, not just which instructions it contains. That means recovering the types, structures, and patterns that compilation erased. This article explains how semantic lifting works, why each component is necessary, and how it connects to the accuracy of vulnerability detection downstream.
The Problem with Opcode-Level Analysis
Consider what happens when a simple Solidity mapping is compiled to EVM bytecode. A declaration like mapping(address => uint256) public balances represents a clear concept: associate each address with a balance value. In the compiled bytecode, this becomes a sequence of stack manipulations, memory operations, and hash computations. The PUSH, DUP, MSTORE, and KECCAK256 instructions bear no obvious relationship to the original mapping access.
“Look up the balance for this address” becomes a hash computation followed by a storage read. An analysis tool working at the opcode level sees only these operations. After semantic lifting, the tool understands that they represent a mapping lookup—a distinction that matters for both human comprehension and automated vulnerability detection.
flowchart LR
classDef data fill:#332a1a,stroke:#d4b870,stroke-width:2px,color:#f0e0c0
classDef process fill:#1a2233,stroke:#7ea8d4,stroke-width:2px,color:#c0d8f0
classDef highlight fill:#1a331a,stroke:#a8c686,stroke-width:2px,color:#c8e8b0
A["Raw Opcodes\nPUSH DUP KECCAK256 SLOAD"]:::data
B["Structural Analysis\nSSA, Type Propagation,\nPattern Recognition"]:::process
C["Semantic Model\nmapping(address => uint256)\nbalances[msg.sender]"]:::highlight
A -->|semantic lifting| B --> C
This gap between opcode sequences and semantic meaning is where vulnerability detection becomes difficult. A reentrancy vulnerability is not “CALL followed by SSTORE”—it is “an external call that executes before state is finalized.” Without semantic lifting, that distinction is invisible.
Type Inference
The EVM operates exclusively on 256-bit words. Every value, regardless of its original type, becomes an undifferentiated 256-bit integer at the bytecode level. Types are recovered by analyzing how values are used throughout the program.
Address Detection
When a value serves as the target of a CALL or DELEGATECALL instruction, it almost certainly represents an address. Comparisons with msg.sender or tx.origin strongly suggest address type. Values passed to transfer operations, or values masked with the 20-byte address mask 0xffffffffffffffffffffffffffffffffffffffff, are similarly identified.
Boolean Recovery
Values that result from comparison operations, serve as branch conditions in conditional jumps, or participate in logical operations are likely booleans. Values that are explicitly constrained to 0 or 1 through masking or conditional logic are classified as boolean types.
Cryptographic and Numeric Types
The bytes32 type appears in cryptographic contexts: inputs to hash operations, storage keys, event topics, and signature parameters. Integer signedness is inferred from arithmetic and comparison opcodes—SDIV and SMOD indicate signed division, SLT and SGT indicate signed comparisons. Range checks provide additional signedness hints.
Type Propagation
Once the type of one value is established, that information propagates through the program along data flow paths. If a value is known to be an address, and that address is used as a mapping key, the mapping’s key type becomes known. If the mapping returns a value used in arithmetic, that value’s type can be inferred. This propagation recovers types even for values that don’t directly exhibit type-identifying patterns.
| Usage Pattern | Inferred Type |
|---|---|
CALL/DELEGATECALL target, comparison with msg.sender | address |
| Result of comparison opcodes, branch condition | bool |
| Input to KECCAK256, event topic, signature parameter | bytes32 |
| SDIV/SMOD/SLT/SGT arithmetic | int256 |
| Arithmetic with range checks, token amount patterns | uint256 |
| Byte extraction with masking | bytesN variants |
| Sequential storage with length prefix | Array element type |
Storage Layout Reconstruction
Solidity organizes contract storage according to deterministic layout rules that depend on variable declarations and ordering. These rules can be reversed to reconstruct the original variable organization from storage access patterns observed in bytecode.
Simple State Variables
Simple state variables occupy sequential storage slots starting from slot zero. When the bytecode loads from or stores to a fixed slot number, this is identified as a simple variable access, and the variable’s position in the original declaration order is inferred.
Mapping Recognition
Mappings use a more complex storage pattern. The storage slot for a mapping value is computed by hashing the concatenation of the key and the mapping’s base slot. When a KECCAK256 operation’s result feeds into a SLOAD or SSTORE, the hash input is analyzed to determine whether this represents a mapping access.
If the input consists of a 32-byte key followed by a constant slot number, the pattern indicates a single-level mapping. Nested mappings produce nested hash computations: keccak256(key2 . keccak256(key1 . slot)) represents mapping[key1][key2].
flowchart TD
classDef process fill:#1a2233,stroke:#7ea8d4,stroke-width:2px,color:#c0d8f0
classDef data fill:#332a1a,stroke:#d4b870,stroke-width:2px,color:#f0e0c0
classDef highlight fill:#332519,stroke:#e8a87c,stroke-width:2px,color:#f0d8c0
A["CALLDATALOAD → caller_address"]:::data
B["PUSH 0x01 → slot_number"]:::data
C["MSTORE, MSTORE → concat"]:::process
D["KECCAK256 → derived_slot"]:::process
E["SLOAD(derived_slot) → value"]:::process
F["Semantic: balances[msg.sender]"]:::highlight
A --> C
B --> C
C --> D
D --> E
E --> F
Dynamic Arrays, Structs, and Packing
Dynamic arrays store their length at the base slot and their elements at slots computed from the hash of the base slot plus the element index. Struct fields occupy consecutive slots starting from the struct’s base slot, with each field at a fixed offset.
Solidity also packs multiple values smaller than 32 bytes into a single storage slot. This is detected by observing masking and shifting operations applied to SLOAD results. When a load result is masked with 0xff and shifted by 8 bits, the access is to a uint8 occupying the low byte of a packed slot. Reconstructing packing is essential for precise storage layout analysis in compact structs.
Proxy Pattern Resolution
Modern smart contracts frequently use proxy patterns for upgradeability. Understanding what a proxy actually does requires resolving through the proxy to analyze the underlying implementation. Without this step, a protocol where all logic passes through a diamond proxy appears as calls to a single forwarding contract, missing the actual function implementations.
flowchart TD
classDef system fill:#1a3333,stroke:#5ba8a8,stroke-width:2px,color:#c0e8e8
classDef process fill:#1a2233,stroke:#7ea8d4,stroke-width:2px,color:#c0d8f0
classDef data fill:#332a1a,stroke:#d4b870,stroke-width:2px,color:#f0e0c0
classDef highlight fill:#332519,stroke:#e8a87c,stroke-width:2px,color:#f0d8c0
A["Proxy Bytecode"]:::data
B["Pattern Detection"]:::process
C{"EIP Type?"}:::process
D["EIP-1967\nStandard Slots"]:::system
E["EIP-1167\nMinimal Clone"]:::system
F["EIP-2535\nDiamond"]:::system
G["Beacon\nProxy"]:::system
H["Read Implementation\nAddress"]:::process
I["Recursive Resolution\n(nested proxies)"]:::process
J["Unified Analysis\nProxy + Implementation"]:::highlight
A --> B --> C
C --> D & E & F & G
D & E & F & G --> H --> I --> J
EIP-1967 defines standard storage slots for transparent and UUPS proxies. The implementation address resides at a specific slot derived from hashing “eip1967.proxy.implementation” minus one. When these characteristic slots appear in bytecode, the contract is identified as an EIP-1967 proxy and the implementation address is read for further analysis.
EIP-1167 minimal proxy clones use a distinctive bytecode pattern that embeds the implementation address directly in the contract code. These contracts delegatecall to a hardcoded address for all function calls. The characteristic 45-byte bytecode sequence is detected and the embedded implementation address extracted.
EIP-2535 diamond proxies support multiple implementation contracts (facets), with function selectors routing to different facet addresses. The function selector dispatch logic is analyzed to identify all facets and their associated functions, providing a complete view of the diamond’s capabilities.
Beacon proxies add a layer of indirection, reading their implementation address from a separate beacon contract. The resolution follows this chain: identify the beacon address, read the implementation from the beacon.
Resolution handles arbitrarily complex proxy arrangements. Proxy chains where one proxy points to another are handled recursively. After resolution, analysis proceeds on the resolved implementation contract rather than the forwarding shell.
Function and Event Signature Recovery
Solidity function names and parameter information are not preserved in compiled bytecode. Only the four-byte function selector—a truncated hash of the function signature—remains.
Database lookup is the primary recovery mechanism. A database of over one million known function signatures collected from verified contracts, interface standards, and common libraries handles common functions instantly. The signature transfer(address,uint256) maps to selector 0xa9059cbb; approve(address,uint256) maps to 0x095ea7b3. Lookup provides full parameter type information for recognized selectors.
Structural analysis handles unknown selectors. The number of parameters is inferred from CALLDATALOAD operations and stack usage patterns. Parameter types are recovered using type inference. Return types are determined from how the function prepares its output data.
For events, the first topic of a log contains the hash of the event signature, enabling the same database lookup approach. The Transfer(address,address,uint256) event signature hashes to a well-known value. Indexed parameters correspond to additional topics; non-indexed parameters are encoded in the log data section. For unknown events, the data layout is parsed according to ABI encoding rules to determine parameter counts and types.
Constant Propagation
Compiled bytecode often contains literal numeric values whose significance is not immediately apparent. Constant propagation annotates these values wherever they can be identified.
Timestamp comparisons that reference specific dates become comprehensible when converted. A comparison like require(block.timestamp < 1704067200) becomes require(block.timestamp < DEADLINE) with an annotation indicating January 1, 2024. Token amounts involving common decimal scales (1e18 for standard tokens, 1e6 for USDC) are recognized and labeled. Time periods like 86400 (one day) or 604800 (one week) receive descriptive names.
| Pattern | Recognized As |
|---|---|
0xffffffffffffffffffffffffffffffffffffffff | Address mask (20-byte type) |
1000000000000000000 | 1 ETH / 1e18 token unit |
1000000 | 1 USDC (6-decimal token unit) |
86400 | 1 day in seconds |
604800 | 1 week in seconds |
0xddf252ad... | ERC-20 Transfer event signature |
0x8c5be1e5... | ERC-20 Approval event signature |
This makes hardcoded protocol parameters immediately visible to security reviewers—including deadline values, threshold amounts, and time windows that may be relevant to attack feasibility.
High-Level Pattern Recognition
Beyond individual value types, semantic lifting recognizes compound patterns corresponding to well-known Solidity idioms.
Ownership patterns: The Ownable pattern appears consistently—a storage slot holding an address, compared against msg.sender before sensitive operations, with a transfer function that validates the new owner. Recognizing this pattern labels the storage slot owner, marks access-controlled functions, and flags any functions that should have this guard but appear to be missing it.
Role-based access control: These patterns use a two-dimensional mapping from address to role identifier, with administrative functions to grant and revoke roles. The analysis recognizes this structure and models the resulting permission system.
Standard token interfaces: ERC-20 contracts exhibit a recognizable combination of storage patterns (balances mapping, allowances nested mapping, total supply slot), function signatures, and event emissions. When this combination is detected, token-specific analysis rules apply. Similar recognition handles ERC-721, ERC-1155, and ERC-4626 vault contracts.
Reentrancy guards: The mutex pattern—a storage slot set to a “locked” value at the start of external call sequences and reset at completion—is identified and used to suppress false positive reentrancy findings in contracts that have implemented proper protections.
Connection to Vulnerability Detection Quality
The quality of semantic lifting directly determines the quality of everything that runs on top of it.
Integer overflow detection at the opcode level is just an ADD instruction. With semantic lifting, the types of operands and their intended ranges are known, determining whether an overflow is possible and whether existing checks are sufficient.
Access control analysis at the opcode level is a comparison instruction and a conditional jump. With semantic lifting, it is an authorization check—and the semantic model reveals whether the right account is being checked for the right operation on the right resource.
Oracle manipulation detection at the opcode level is a sequence of SLOAD and arithmetic operations. With semantic lifting, it is a price oracle query followed by a financial calculation, and the model reveals whether appropriate staleness checks and manipulation-resistance measures are in place.
Every precision improvement in type inference, storage reconstruction, or pattern recognition propagates directly into detector accuracy. The reentrancy detector benefits from knowing which storage slots matter. The proxy storage collision detector only works if proxy and implementation storage layouts are both correctly mapped. Signature recovery determines whether a function the detector considers sensitive is actually the ERC-20 transfer function or something else entirely.
Semantic lifting is not a preprocessing step that can be approximated—it is the foundation that determines what can and cannot be detected accurately.