Taint Propagation in Stack-Based Virtual Machines
Information Flow Analysis for EVM and eBPF Bytecode
Lattice-based taint analysis framework for stack-based virtual machines, with formal transfer functions and precision analysis for smart contract security.
Taint Propagation in Stack-Based Virtual Machines
Abstract
Taint analysis tracks the flow of untrusted data through a program, marking values derived from attacker-controlled sources and checking whether they reach security-critical sinks. When applied to stack-based virtual machines such as the Ethereum Virtual Machine (EVM) and Solana’s extended Berkeley Packet Filter (eBPF), taint propagation encounters structural challenges absent from register-based or source-level analysis: implicit data flow through stack position, destructive stack operations that obscure provenance, and cryptographic operations that sever taint chains while preserving semantic dependency. This paper defines a lattice-based taint domain over abstract stack machine states, presents forward and backward propagation algorithms with fixpoint characterization, and derives formal transfer functions for representative EVM and eBPF opcodes. We analyze the precision-soundness tradeoff inherent in each design choice—where over-tainting generates false vulnerability reports and under-tainting permits missed detections—and characterize the specific points at which stack machine semantics force this tradeoff. We give particular attention to cryptographic hash operations used in storage key derivation, control-dependence-induced implicit flows, and aliasing through dynamic dispatch. The analysis concludes with a framework for measuring taint precision against ground-truth vulnerability corpora.
1. Introduction
Source-level taint analysis benefits from named variables, explicit type information, and structured control flow. When source code is unavailable—as is the case for the vast majority of deployed smart contracts and on-chain Solana programs—analysis must operate directly on bytecode. This introduces fundamental difficulties.
Stack-based virtual machines represent intermediate computation state as a last-in-first-out stack rather than named registers. A value’s identity is determined by its position, which shifts with every push, pop, or swap. There are no variable names to attach taint labels to; there is only the evolving stack configuration. Memory and storage provide additional state dimensions with their own addressing semantics.
The practical motivation is security analysis of deployed contracts. Vulnerability classes such as reentrancy, access control bypass, oracle manipulation, and price feed corruption all involve untrusted data reaching operations where it should not. Taint analysis provides a principled framework for detecting these flows. But the gap between the theoretical framework and a useful implementation is wide, and the specific properties of stack machines determine where precision is lost.
This paper addresses three questions. First, what is the appropriate abstract domain for taint analysis over stack machine states? Second, what are the formal transfer functions for the core instruction sets, and where do they force precision loss? Third, how should forward and backward propagation be combined to balance cost against detection quality?
We focus on the EVM (Ethereum Yellow Paper, Wood 2014) and eBPF as instantiated by the Solana runtime (Solana Labs, “Solana Program Runtime”), though the framework generalizes to other stack and register machines.
2. Taint Lattice Definition
2.1 Taint Labels
Let $\mathcal{L}$ be a finite set of taint labels representing distinct untrusted data sources. In the EVM context, typical labels include:
- $\ell_{\text{calldata}}$: transaction input data (attacker-controlled)
- $\ell_{\text{caller}}$: the
msg.senderaddress - $\ell_{\text{callvalue}}$: the
msg.valuefield - $\ell_{\text{extcode}}$: data returned by external calls
- $\ell_{\text{storage}}$: values loaded from contract storage (conditionally tainted)
- $\ell_{\text{returndatacopy}}$: return data from sub-calls
For eBPF/Solana programs, the analogous labels are $\ell_{\text{instruction_data}}$ (transaction instruction data), $\ell_{\text{account}}$ (account data from input accounts), and $\ell_{\text{cpi_return}}$ (cross-program invocation results).
2.2 The Taint Powerset Lattice
The taint domain for a single value is the powerset lattice $(\mathcal{P}(\mathcal{L}), \subseteq, \cup, \cap, \emptyset, \mathcal{L})$:
- Bottom $\bot = \emptyset$: untainted (clean)
- Top $\top = \mathcal{L}$: tainted by all sources
- Join $\sqcup = \cup$: merging taint from multiple paths
- Meet $\sqcap = \cap$: intersection (used in must-taint analysis)
This forms a complete lattice of height $|\mathcal{L}| + 1$. The join operation is used at control flow merge points: if a value is tainted along one path and untainted along another, the join produces a tainted result. This is the standard may-analysis choice that preserves soundness at the cost of precision.
2.3 Abstract Machine State
An abstract state $\sigma$ consists of four components:
$$\sigma = (\hat{S}, \hat{M}, \hat{T}, \hat{R})$$
where:
- $\hat{S} : \text{Stack} \to \mathcal{P}(\mathcal{L})$ assigns a taint set to each stack position
- $\hat{M} : \mathbb{N}_{256} \to \mathcal{P}(\mathcal{L})$ assigns taint to each memory byte (or word-aligned region)
- $\hat{T} : \mathbb{N}_{256} \to \mathcal{P}(\mathcal{L})$ assigns taint to each storage slot
- $\hat{R} : \text{Reg} \to \mathcal{P}(\mathcal{L})$ assigns taint to each register (for eBPF; empty for pure stack machines)
The abstract state lattice is the pointwise product of these component lattices. The join of two states is component-wise union at each position.
flowchart TB
classDef system fill:#1a3333,stroke:#5ba8a8,stroke-width:2px,color:#c0e8e8
classDef process fill:#1a2233,stroke:#7ea8d4,stroke-width:2px,color:#c0d8f0
classDef data fill:#332a1a,stroke:#d4b870,stroke-width:2px,color:#f0e0c0
classDef highlight fill:#332519,stroke:#e8a87c,stroke-width:2px,color:#f0d8c0
TOP["⊤ = {ℓ_calldata, ℓ_caller, ℓ_callvalue, ℓ_extcode, ...}"]:::highlight
CD["{ℓ_calldata, ℓ_caller}"]:::data
CV["{ℓ_calldata, ℓ_callvalue}"]:::data
CE["{ℓ_caller, ℓ_extcode}"]:::data
C["{ℓ_calldata}"]:::process
R["{ℓ_caller}"]:::process
V["{ℓ_callvalue}"]:::process
E["{ℓ_extcode}"]:::process
BOT["⊥ = ∅ (untainted)"]:::system
TOP --> CD
TOP --> CV
TOP --> CE
CD --> C
CD --> R
CV --> C
CV --> V
CE --> R
CE --> E
C --> BOT
R --> BOT
V --> BOT
E --> BOT
2.4 Taint Granularity
A key design choice is the granularity of memory taint tracking. Word-level tracking (one taint set per 32-byte EVM word) is efficient but imprecise when individual bytes carry different taint. Byte-level tracking is precise but expensive—EVM memory is addressed in bytes, and a single MSTORE writes 32 bytes. In practice, word-level tracking with byte-level refinement at CALLDATACOPY and RETURNDATACOPY boundaries provides a reasonable tradeoff.
For storage, slot-level granularity is natural for the EVM (each SLOAD/SSTORE operates on a single 256-bit slot). However, Solidity’s tight variable packing places multiple values in a single slot, requiring sub-slot taint tracking for precision.
3. Transfer Functions
A transfer function $\tau_{\text{op}} : \Sigma \to \Sigma$ maps an abstract state before an instruction to the abstract state after it. We define $\tau$ for representative instruction categories.
3.1 Notation
Let $\hat{S}[i]$ denote the taint of the $i$-th element from the top of the stack (0-indexed). Let $\text{push}(t, \hat{S})$ push taint set $t$ onto the stack, and $\text{pop}(\hat{S})$ remove the top element.
3.2 Arithmetic and Logic Operations
For a binary arithmetic operation $\text{op}(a, b) \to c$ (e.g., ADD, SUB, MUL, DIV, MOD, EXP):
$$\tau_{\text{binop}}(\sigma) = \sigma[\hat{S} \mapsto \text{push}(\hat{S}[0] \cup \hat{S}[1],\ \text{pop}(\text{pop}(\hat{S})))]$$
The result inherits the union of both operands’ taint. This is sound: if either operand is attacker-controlled, the result depends on attacker input. It is also a source of over-tainting. The expression x * 0 always yields zero regardless of x, but the transfer function taints the result with x’s labels. Constant folding in a preceding analysis pass mitigates this for literal zeros, but the general problem—determining when an operation produces an output independent of a tainted input—is undecidable.
Unary operations (ISZERO, NOT) propagate the operand’s taint directly:
$$\tau_{\text{unop}}(\sigma) = \sigma[\hat{S} \mapsto \text{push}(\hat{S}[0],\ \text{pop}(\hat{S}))]$$
Comparison operations (LT, GT, SLT, SGT, EQ) follow the binary rule. This is significant: a comparison between a tainted and an untainted value produces a tainted boolean. If that boolean controls a branch, this creates an implicit information flow (Section 7.1).
3.3 Stack Manipulation
Stack operations are where stack machines diverge most sharply from register machines for taint analysis.
DUP$_n$ copies the $n$-th stack element to the top:
$$\tau_{\text{DUP}_n}(\sigma) = \sigma[\hat{S} \mapsto \text{push}(\hat{S}[n-1],\ \hat{S})]$$
The duplicated value carries the same taint as the original. Critically, the original remains in place—there are now two stack positions with the same taint set but no syntactic link between them.
SWAP$_n$ exchanges the top element with the $(n+1)$-th:
$$\tau_{\text{SWAP}_n}(\sigma) = \sigma[\hat{S} \mapsto \hat{S}[0 \mapsto \hat{S}[n],\ n \mapsto \hat{S}[0]]]$$
This is a pure permutation and preserves all taint. However, after a sequence of SWAPs, determining which original value occupies which stack position requires tracking through the full permutation history. Abstract interpretation handles this correctly by propagating taint sets through positions, but it illustrates why source-level variable names would simplify matters considerably.
POP discards the top element. If the popped value was tainted, that taint information is lost. For forward analysis this is correct—the value is no longer reachable. For backward analysis (Section 6), a POP of a taint-demanded value means the demand must propagate upward to wherever the value was produced.
PUSH$_n$ pushes a constant. Constants are always untainted:
$$\tau_{\text{PUSH}_n}(\sigma) = \sigma[\hat{S} \mapsto \text{push}(\emptyset,\ \hat{S})]$$
3.4 Memory Operations
MSTORE(offset, value) writes a 32-byte value to memory:
$$\tau_{\text{MSTORE}}(\sigma) = \sigma[\hat{M} \mapsto \hat{M}[\hat{S}[0]_{\text{concrete}} \mapsto \hat{S}[1]],\ \hat{S} \mapsto \text{pop}(\text{pop}(\hat{S}))]$$
When the offset is a concrete constant (resolvable through constant propagation), the memory write targets a known location and taint is precisely assigned. When the offset is itself tainted or symbolic, soundness requires tainting all potentially addressed memory—a catastrophic loss of precision called the tainted-index problem. Practical implementations bound the damage by restricting memory to observed concrete ranges and issuing a warning when symbolic writes are encountered.
MLOAD(offset) reads 32 bytes from memory:
$$\tau_{\text{MLOAD}}(\sigma) = \sigma[\hat{S} \mapsto \text{push}(\hat{M}[\hat{S}[0]_{\text{concrete}}],\ \text{pop}(\hat{S}))]$$
If the offset is tainted, the loaded value must conservatively receive $\top$ taint.
3.5 Storage Operations
SLOAD(slot) and SSTORE(slot, value) follow the same pattern as memory but with an additional complication: storage slots are often computed via KECCAK256 (Section 7.2).
$$\tau_{\text{SLOAD}}(\sigma) = \sigma[\hat{S} \mapsto \text{push}(\hat{T}[\text{resolve}(\hat{S}[0])],\ \text{pop}(\hat{S}))]$$
$$\tau_{\text{SSTORE}}(\sigma) = \sigma[\hat{T} \mapsto \hat{T}[\text{resolve}(\hat{S}[0]) \mapsto \hat{S}[1]],\ \hat{S} \mapsto \text{pop}(\text{pop}(\hat{S}))]$$
The function $\text{resolve}$ attempts to determine the concrete slot. When the slot key is derived from tainted input hashed through KECCAK256, exact resolution requires symbolic reasoning about the hash preimage (see Section 7.2).
3.6 External Calls
CALL and its variants (STATICCALL, DELEGATECALL, CALLCODE) are the most complex transfer functions. A CALL consumes seven stack arguments and produces a boolean success indicator. The transfer function must:
- Taint the return value (success boolean) with the union of all argument taints plus $\ell_{\text{extcode}}$, since the callee’s behavior is not known.
- Taint the memory region designated for return data with $\ell_{\text{extcode}}$.
- If the call target address is tainted, the entire call is to an attacker-controlled contract, escalating the severity of any subsequent data use.
$$\tau_{\text{CALL}}(\sigma) = \sigma[\hat{S} \mapsto \text{push}({\ell_{\text{extcode}}} \cup \hat{S}[1],\ \text{pop}^7(\hat{S})),$$ $$\hat{M} \mapsto \hat{M}[\text{retOffset} .. \text{retOffset} + \text{retSize} \mapsto {\ell_{\text{extcode}}}]]$$
For eBPF, the analogous operation is cross-program invocation (CPI), where the callee is a Solana program identified by its program ID. The taint treatment is similar: return values and modified account data receive $\ell_{\text{cpi_return}}$.
3.7 Transfer Function Summary
| Instruction Class | Stack Effect | Taint Rule |
|---|---|---|
Binary arithmetic (ADD, MUL, …) | pop 2, push 1 | Union of operand taints |
Unary (ISZERO, NOT) | pop 1, push 1 | Propagate operand taint |
Comparison (LT, EQ, …) | pop 2, push 1 | Union of operand taints |
DUP$_n$ | push 1 (copy of depth $n$) | Copy taint from depth $n$ |
SWAP$_n$ | permute positions 0 and $n$ | Permute taint |
PUSH$_n$ | push 1 | $\emptyset$ (constant) |
POP | pop 1 | Discard taint |
MSTORE | pop 2 | Write operand taint to memory |
MLOAD | pop 1, push 1 | Read taint from memory |
SLOAD | pop 1, push 1 | Read taint from storage |
SSTORE | pop 2 | Write operand taint to storage |
CALL | pop 7, push 1 | $\ell_{\text{extcode}}$ + address taint |
KECCAK256 | pop 2, push 1 | See Section 7.2 |
CALLDATALOAD | pop 1, push 1 | ${\ell_{\text{calldata}}}$ |
CALLER | push 1 | ${\ell_{\text{caller}}}$ |
CALLVALUE | push 1 | ${\ell_{\text{callvalue}}}$ |
4. Forward Propagation
4.1 Algorithm
Forward taint propagation computes a fixpoint over the control flow graph (CFG) by iterating abstract transfer functions from entry to exit. The algorithm is a standard worklist-based chaotic iteration:
FORWARD-TAINT(CFG, sources):
for each block b in CFG:
state[b] = ⊥ // initialize all blocks to untainted
state[entry] = init_taint(sources) // seed source taints at entry
worklist = {entry}
while worklist ≠ ∅:
b = worklist.remove()
out = transfer(b, state[b]) // apply block's transfer functions sequentially
for each successor s of b:
merged = state[s] ⊔ out // join at merge points
if merged ≠ state[s]:
state[s] = merged
worklist.add(s)
return state
4.2 Fixpoint Existence and Termination
The abstract domain is a complete lattice of finite height (bounded by $|\mathcal{L}| \times |\text{positions}|$). All transfer functions are monotone: if $\sigma_1 \sqsubseteq \sigma_2$ then $\tau(\sigma_1) \sqsubseteq \tau(\sigma_2)$. By the Kleene fixpoint theorem, the iteration converges to the least fixpoint.
4.3 Complexity
Let $n$ be the number of basic blocks, $k = |\mathcal{L}|$ the number of taint labels, and $p$ the maximum number of tracked positions (stack depth + memory locations + storage slots). Each block can be visited at most $O(k \cdot p)$ times, since each visit must strictly increase the taint state. Each visit costs $O(|b| \cdot p)$ where $|b|$ is the number of instructions in the block. Total worst-case complexity is $O(n \cdot k \cdot p^2 \cdot |b|_{\text{max}})$.
In practice, convergence is much faster. Most taint propagation stabilizes within two or three iterations because taint flows forward monotonically and rarely requires many join operations to reach a fixpoint. The limiting factor is typically the number of loop back-edges in the CFG rather than the lattice height.
4.4 Intraprocedural vs. Interprocedural
For EVM bytecode, which typically consists of a single contract without internal function calls in the traditional sense (internal functions are implemented as JUMP targets), forward propagation is intraprocedural. Cross-contract analysis requires separate taint summaries for each contract and propagation of taint through CALL boundaries, which is substantially more expensive and discussed in Section 7.3.
For eBPF/Solana, the program structure includes explicit function calls (BPF_CALL), requiring interprocedural analysis. Function summaries—mapping input register taint to output register taint—enable modular analysis without full inlining.
5. Backward Propagation
5.1 Motivation
Forward propagation computes all values reachable from taint sources. This answers “what can be influenced by attacker input?” but not “what influences this specific security-critical operation?” When the set of sinks is small relative to the number of source-derived values, backward (demand-driven) propagation is more efficient.
5.2 Algorithm
Backward propagation starts from sinks and traces data dependencies in reverse:
BACKWARD-TAINT(CFG, sinks):
for each block b in CFG:
demand[b] = ⊥
for each (block b, position p) in sinks:
demand[b] = demand[b] ⊔ {(b, p)}
worklist = {b : (b, _) ∈ sinks}
while worklist ≠ ∅:
b = worklist.remove()
in_demand = reverse_transfer(b, demand[b])
for each predecessor p of b:
merged = demand[p] ⊔ in_demand
if merged ≠ demand[p]:
demand[p] = merged
worklist.add(p)
return {v : v ∈ demand[entry] ∧ is_source(v)}
The reverse_transfer function inverts the forward transfer: for each instruction that contributes to a demanded output, it generates demand for the corresponding inputs. For ADD, if the output is demanded, both operands are demanded. For DUP$_n$, if the top-of-stack copy is demanded, the original at depth $n$ is demanded.
5.3 Tradeoffs: Forward vs. Backward
| Dimension | Forward | Backward |
|---|---|---|
| Starting point | Sources | Sinks |
| Computes | All tainted values | Provenance of specific values |
| Cost driver | Number of source-derived values | Number of sink-demanded values |
| False positives | Reports tainted values at sinks regardless of exploitability | Only reports if a source reaches the demanded sink |
| Interaction with control flow | Naturally follows execution order | Requires reverse CFG; loop handling is more complex |
In practice, a combined approach yields the best results: forward propagation establishes the taint map, then backward queries from specific sinks verify that the taint genuinely flows from a relevant source. This two-phase approach eliminates false positives where a value at a sink is tainted but the taint originates from a benign source.
6. eBPF Register Machine Considerations
While the EVM is a pure stack machine, Solana’s eBPF runtime uses a register-based architecture with 11 general-purpose 64-bit registers (r0-r10). This simplifies some aspects of taint tracking while complicating others.
Registers provide stable names—unlike stack positions, r3 always refers to r3. Taint does not shift when unrelated operations occur. However, eBPF introduces challenges absent from the EVM:
- Pointer arithmetic: eBPF permits bounded pointer arithmetic on register values pointing into account data or stack memory. A tainted offset added to a base pointer creates a tainted pointer, and any load through that pointer must be treated as tainted.
- Variable-length account data: Solana accounts can be up to 10 MB. Taint tracking over this space requires region-based abstraction rather than per-byte tracking.
- Instruction data deserialization: Unlike EVM’s
CALLDATALOADwhich loads at explicit 32-byte-aligned offsets, Solana instruction data is typically deserialized through custom code, requiring the taint analysis to trace through deserialization logic.
The transfer functions for eBPF arithmetic, logic, and branch instructions follow the same lattice framework as Section 3, substituting register names for stack positions.
7. Precision Challenges
7.1 Implicit Flows Through Control Dependence
An explicit flow occurs when tainted data directly affects a value through a data dependency. An implicit flow occurs when tainted data affects a value through a control dependency—the value is assigned in a branch whose condition depends on tainted data.
// Explicit flow (captured by standard taint propagation):
y = f(tainted_x)
// Implicit flow (missed without control dependence tracking):
if (tainted_x > 0):
y = 1
else:
y = 0
In EVM bytecode, this pattern manifests as a JUMPI whose condition is tainted, followed by different PUSH/SSTORE sequences on each branch. The values stored are constants (untainted by standard transfer functions), but they carry information about the tainted condition.
Tracking implicit flows requires computing the control dependence relation over the CFG and tainting all values assigned within a region that is control-dependent on a tainted branch. This is sound but dramatically increases false positives: in a typical EVM contract, most computation occurs within branches dependent on msg.sender (a tainted value), so nearly all values become tainted.
The practical compromise is to track only explicit flows for general-purpose vulnerability detection, and to enable implicit flow tracking selectively for specific vulnerability classes (e.g., information leakage through timing side channels) where it is essential.
7.2 Cryptographic Sinks: KECCAK256 and Storage Key Derivation
The KECCAK256 opcode computes the Keccak-256 hash of a memory region. In Solidity, mapping accesses compile to KECCAK256(key . slot) where . denotes concatenation and slot is the mapping’s base storage position. This means virtually every mapping read or write involves hashing.
The taint question is: should the hash output inherit the taint of its input?
Argument for propagation: The hash output depends on the input. If an attacker controls the mapping key (e.g., balances[msg.sender]), they influence which storage slot is accessed. An access control vulnerability might allow an attacker to read or write arbitrary slots by choosing the key.
Argument against propagation: Keccak-256 is a cryptographic hash. The output reveals no information about the input (under standard cryptographic assumptions). Propagating taint through the hash would taint every storage slot accessed via a mapping keyed by any tainted value—which is nearly all of them.
The resolution requires distinguishing between two roles of the hash:
- As a storage index: the hash serves as an address, and the taint of the key determines which slot is accessed. The slot address should inherit the key’s taint, but the value loaded from that slot should receive the slot’s own taint (from $\hat{T}$), not the key’s taint.
- As a data transformation: if the hash output is used as data (not merely as an address), taint should propagate.
flowchart LR
classDef system fill:#1a3333,stroke:#5ba8a8,stroke-width:2px,color:#c0e8e8
classDef process fill:#1a2233,stroke:#7ea8d4,stroke-width:2px,color:#c0d8f0
classDef data fill:#332a1a,stroke:#d4b870,stroke-width:2px,color:#f0e0c0
classDef highlight fill:#332519,stroke:#e8a87c,stroke-width:2px,color:#f0d8c0
classDef attack fill:#331a1a,stroke:#d97777,stroke-width:2px,color:#f0c0c0
KEY["Mapping Key\n(tainted: ℓ_calldata)"]:::attack
HASH["KECCAK256\n(index derivation)"]:::process
SLOT["Storage Slot Address\n(tainted: ℓ_calldata)"]:::data
SLOAD_OP["SLOAD"]:::process
VALUE["Loaded Value\n(taint from storage state)"]:::highlight
SINK["Sink: CALL value\nor SSTORE"]:::system
KEY --> HASH --> SLOT --> SLOAD_OP --> VALUE --> SINK
Formally, we define a taint-decoupling rule for hash-as-index:
$$\tau_{\text{KECCAK256-index}}(\sigma) = \sigma[\hat{S} \mapsto \text{push}(\hat{S}[0]_{\text{offset-taint}},\ \text{pop}(\text{pop}(\hat{S})))]$$
where $\text{offset-taint}$ is recorded as address taint that affects which slot is accessed but does not contaminate the value loaded from that slot. This requires a two-component taint label distinguishing address taint from value taint—an extension of the basic powerset lattice.
7.3 Aliasing Through Dynamic Dispatch
EVM DELEGATECALL executes another contract’s code in the context of the calling contract’s storage. From a taint perspective, this means the callee can read and write the caller’s storage, but the callee’s code is not available for analysis (it resides at a separate address). The taint effect of a DELEGATECALL must be conservatively approximated.
Similarly, proxy patterns use DELEGATECALL to forward all calls to an implementation contract whose address may itself be stored in a storage slot. If the implementation address is tainted (e.g., it can be changed by a privileged actor), then the behavior of every delegated call is effectively attacker-controlled.
For eBPF/Solana, CPI to a program whose ID is derived from tainted input creates the analogous problem. If the program ID is attacker-controlled, any CPI transfers control to arbitrary code.
The sound approximation is to treat the output of any call with a tainted target as fully tainted ($\top$). This is coarse but unavoidable without interprocedural analysis of the callee.
7.4 The Tainted-Index Problem Revisited
When a memory or storage index is itself tainted, the analysis cannot determine which concrete location is being read or written. The spectrum of approximations:
- Full havoc: taint all memory/storage locations ($\top$ for all positions). Sound but useless—everything becomes tainted.
- Points-to set: compute the set of possible concrete values for the index (using value-set analysis or interval abstract domain), and taint only those locations. Precise but expensive and incomplete for symbolic indices.
- Region-based: partition memory into named regions (e.g., “calldata buffer,” “return buffer,” “scratch space”) and taint at region granularity. A practical middle ground.
- Ignore: treat symbolic indices as concrete unknowns and skip them. Unsound—may miss genuine taint flows.
The choice among these approximations is the single largest determinant of practical taint precision in stack machine analysis.
8. Evaluation Framework
Measuring taint analysis quality requires a ground truth: for each potential vulnerability report, is it a true positive (real vulnerability), false positive (spurious report), or false negative (missed vulnerability)?
8.1 Metrics
- Precision = TP / (TP + FP): fraction of reported taint flows that are genuine.
- Recall = TP / (TP + FN): fraction of genuine taint flows that are reported.
- F1 = 2 * Precision * Recall / (Precision + Recall): harmonic mean.
8.2 Ground Truth Construction
For EVM contracts, ground truth can be constructed from:
- Known exploit transactions: real-world exploits provide definitive true positives with concrete data flow traces.
- Manually audited contracts: professional audit reports identify vulnerabilities that serve as TP/FN references.
- The SWC Registry (Smart Contract Weakness Classification): catalogued weakness patterns with sample contracts.
For Solana programs, the Neodyme and sec3 vulnerability disclosures provide analogous ground truth.
8.3 Over-Tainting and Under-Tainting Characterization
For any specific analysis configuration, the sources of false positives and false negatives can be categorized:
| Source of FP (over-tainting) | Source of FN (under-tainting) |
|---|---|
| Union at join points when branches are mutually exclusive | Missing taint source (e.g., not tainting ORIGIN) |
Taint propagation through x * 0 or similar absorbing operations | Taint dropped at KECCAK256 used as data, not index |
| Implicit flow tracking through control dependence | Missing implicit flow tracking |
| Conservative memory havoc on symbolic index | Ignoring tainted indices |
| Return data from benign external calls | Cross-contract flows not tracked |
This characterization is actionable: each source of imprecision maps to a specific design choice in the transfer functions, and the tradeoff can be tuned per vulnerability class.
9. Related Work
The foundations of taint analysis trace to Denning and Denning’s lattice model for secure information flow (1977). The powerset lattice formulation used here follows directly from their framework, adapted to the operational semantics of stack machines.
Schwartz, Avgerinos, and Brumley (2010) provide a systematic survey of dynamic taint analysis techniques (“All You Ever Wanted to Know About Dynamic Taint Analysis and Forward Symbolic Execution”), establishing the standard taxonomy of explicit vs. implicit flows and the over/under-tainting tradeoffs.
For EVM-specific analysis, Grech et al. (2018) introduce Madmax, which performs gas-focused vulnerability detection using Datalog-based analysis over decompiled EVM bytecode. Their approach uses an intermediate representation that abstracts away the stack, recovering variable-like entities before applying data flow analysis—effectively performing semantic lifting as a prerequisite for taint-like reasoning.
Brent et al. (2020) present Ethainter, which formulates information flow properties for Ethereum smart contracts in Datalog and computes composite taint vulnerabilities through the Gigahorse decompiler’s three-address code representation. Their work demonstrates that lifting bytecode to a register-based IR before applying taint analysis is a practical approach, though it introduces imprecision at the lifting boundary.
Feist, Grieco, and Buterin (2019) describe Slither, which performs taint and data dependency analysis at the Solidity AST level. While effective when source code is available, their approach does not apply to contracts where only bytecode is deployed.
Krupp and Rossow (2018) present teEther, which uses symbolic execution to generate exploit payloads for EVM contracts, implicitly computing taint as a byproduct of constraint collection. Their work illustrates the connection between taint analysis and symbolic execution—taint tracking can be viewed as a lightweight abstract interpretation that approximates the constraint sets symbolic execution would compute.
For eBPF, Gershuni et al. (2019) present a formal verification framework for eBPF programs using abstract interpretation, establishing the feasibility of static analysis for this instruction set. Their work on verifying safety properties of eBPF bytecode provides a foundation for taint-specific analysis in the Solana context.
The abstract interpretation framework underlying our fixpoint computation follows Cousot and Cousot (1977). The specific application of Galois connections to information flow analysis is developed in Giacobazzi and Mastroeni (2004).
10. References
-
P. Cousot and R. Cousot. “Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints.” POPL, 1977.
-
D. E. Denning and P. J. Denning. “Certification of Programs for Secure Information Flow.” Communications of the ACM, 20(7), 1977.
-
E. J. Schwartz, T. Avgerinos, and D. Brumley. “All You Ever Wanted to Know About Dynamic Taint Analysis and Forward Symbolic Execution (but Might Have Been Afraid to Ask).” IEEE S&P, 2010.
-
N. Grech, M. Kong, A. Scholz, and B. Smaragdakis. “MadMax: Surviving Out-of-Gas Conditions in Ethereum Smart Contracts.” OOPSLA, 2018.
-
L. Brent, N. Grech, S. Lagouvardos, B. Scholz, and Y. Smaragdakis. “Ethainter: A Smart Contract Security Analyzer for Composite Vulnerabilities.” PLDI, 2020.
-
J. Feist, G. Grieco, and A. Groce. “Slither: A Static Analysis Framework for Smart Contracts.” WETSEB, 2019.
-
J. Krupp and C. Rossow. “teEther: Gnawing at Ethereum to Automatically Exploit Smart Contracts.” USENIX Security, 2018.
-
E. Gershuni, N. Amit, A. Gurfinkel, N. Narodytska, J. A. Navas, N. Rinetzky, L. Ryzhyk, and M. Sagiv. “Simple and Precise Static Analysis of Untrusted Linux Kernel Extensions.” PLDI, 2019.
-
R. Giacobazzi and I. Mastroeni. “Abstract Non-Interference: Parameterizing Non-Interference by Abstract Interpretation.” POPL, 2004.
-
G. Wood. “Ethereum: A Secure Decentralised Generalised Transaction Ledger.” Ethereum Yellow Paper, 2014. https://ethereum.github.io/yellowpaper/
-
Solana Labs. “Solana Program Runtime.” https://docs.solana.com/developing/programming-model/runtime
-
SWC Registry. “Smart Contract Weakness Classification and Test Cases.” https://swcregistry.io/