Sandboxed Code Execution for AI Agents: Security, Architecture, and Production Patterns
The boundary is enforced by sandboxing. Isolate the code execution in an environment with minimal capabilities and controlled resource limits.
Code execution is the most powerful tool an AI agent can have. It is also the most dangerous. An agent that can run arbitrary code can read files it shouldn't, make network calls to exfiltrate data, consume unbounded compute, install persistent backdoors, or crash the host system.
The temptation with early agent prototypes is to run code directly on the host machine with subprocess or exec. This works fine until the agent generates import os; os.system("rm -rf /") or a dependency chain that triggers a network fetch to an attacker-controlled server. In demos, the code you feed the agent is carefully crafted and benign. In production, users provide the input. The attack surface is as wide as everything the model has been trained on.
Production-grade code execution for AI agents requires a security boundary between the agent's execution environment and everything else: the host filesystem, the network, the other users of the same system. The boundary is enforced by sandboxing. Isolate the code execution in an environment with minimal capabilities and controlled resource limits.
The full technical stack for production sandboxed code execution includes the threat model for AI code execution, the sandbox technologies available (containers, gVisor, Firecracker microVMs, E2B), resource limits and output capture, secure file transfer patterns, and the architecture for integrating sandboxed execution into an agent tool call pipeline.
The Threat Model: What Can Go Wrong With AI Code Execution
The threat model for AI code execution differs from traditional code execution because the code is generated by a model that users can prompt. The attack surface includes:
Direct attacks via code generation:
- File system access: read /etc/passwd, write to arbitrary paths, delete files
- Process spawning: launch background processes, install cron jobs
- Network access: exfiltrate data, download malware, callback to C2 servers
- Resource exhaustion: fork bombs, infinite loops, large memory allocations
Prompt injection attacks:
- User provides input that causes the agent to generate malicious code
- Data in the execution environment contains injected instructions
- LLM output itself is adversarial (jailbroken or misconfigured model)
Indirect attacks:
- Package imports that have side effects:
import malicious_packageruns setup.py which executes shell commands - Data exfiltration through output: code writes sensitive data to stdout and the agent forwards it to an attacker
- SSRF (Server Side Request Forgery): code makes HTTP requests to internal network endpoints
The minimum security requirements for any production code execution environment:
-
Filesystem isolation: The sandbox can only access a specific, controlled filesystem. No access to host filesystem or other users' data.
-
Network isolation: Code cannot make outbound network connections by default. If network access is needed, it is explicitly allowlisted.
-
Process isolation: Code cannot spawn processes that outlive the sandbox execution.
-
Resource limits: CPU, memory, file size, and execution time are all bounded.
-
No privilege escalation: Code cannot become root, modify kernel parameters, or escape the sandbox.
Sandbox Isolation Levels: Containers vs gVisor vs MicroVMs
Three isolation technologies provide progressively stronger security boundaries:
Container isolation (Docker, Podman):
- Isolation: Linux namespaces (pid, network, mount, user)
- Security level: Moderate (container processes share the host kernel)
- Attack surface: Kernel exploits can escape container isolation
- Performance: Near-native (~1% overhead for CPU-bound tasks)
- Startup time: 100-500ms
gVisor (Google's container sandbox):
- Isolation: User-space kernel (Sentry) intercepts all system calls
- Security level: High (syscalls are validated before reaching the host kernel)
- Attack surface: Limited to gVisor's syscall implementation (about 300 syscalls vs 400+ in Linux)
- Performance: 10-30% overhead for syscall-heavy workloads, near-native for CPU-bound
- Startup time: 200-800ms
Firecracker microVMs:
- Isolation: Hardware virtualization (KVM) with separate VM kernel
- Security level: Highest (true VM isolation with minimal attack surface: 5 device types)
- Attack surface: Limited to hypervisor interface (intentionally minimal)
- Performance: Near-native for CPU computation, about 5% memory overhead
- Startup time: 100-300ms (Firecracker's fast VM boot is a core feature)
Choosing isolation level by use case:
| Use case | Recommended isolation | Rationale |
|---|---|---|
| Internal tooling, trusted users | Container with resource limits | Speed, simplicity |
| SaaS product, user-provided inputs | gVisor or microVM | Kernel exploit risk from user inputs |
| Multi-tenant, adversarial inputs | Firecracker microVM | Maximum isolation for hostile code |
| Data science notebooks | gVisor | Balance security and performance |
from dataclasses import dataclass
from enum import Enum
class IsolationLevel(Enum):
CONTAINER = "container" # Docker namespaces
GVISOR = "gvisor" # gVisor user-space kernel
MICROVM = "microvm" # Firecracker/Cloud Hypervisor
@dataclass
class SandboxConfig:
isolation: IsolationLevel = IsolationLevel.GVISOR
cpu_cores: float = 0.5 # Fractional CPU allocation
memory_mb: int = 512 # Memory limit
disk_mb: int = 1024 # Filesystem size limit
timeout_seconds: int = 30 # Hard execution timeout
network_enabled: bool = False # Network access (off by default)
allowed_domains: list = None # If network enabled, allowlist
max_file_size_mb: int = 50 # Maximum output file size
max_processes: int = 10 # Maximum concurrent processesE2B: Managed Sandboxes for AI Agents
E2B (formerly e2b.dev) is a cloud service that provides on-demand Linux sandboxes designed for AI agent code execution. It abstracts the VM/container management and exposes a clean Python SDK.
from e2b_code_interpreter import CodeInterpreter
import asyncio
async def run_code_with_e2b(
code: str,
language: str = "python",
timeout_seconds: int = 30,
) -> dict:
"""
Execute code in an E2B sandbox.
Each sandbox is an isolated microVM (Firecracker-based).
"""
async with CodeInterpreter() as sandbox:
# Execute code
execution = await sandbox.notebook.exec_cell(
code,
timeout=timeout_seconds,
)
return {
"stdout": execution.text,
"stderr": "\n".join(str(e) for e in execution.error) if execution.error else "",
"outputs": [
{"type": output.type, "data": output.data}
for output in execution.results
],
"error": bool(execution.error),
}
async def run_multi_step_analysis(
steps: list[str],
shared_files: dict[str, bytes] = None,
) -> list[dict]:
"""
Run multiple code steps in the same sandbox (state persists between steps).
Key E2B capability: stateful execution across multiple agent steps.
"""
results = []
async with CodeInterpreter() as sandbox:
# Upload any shared files to the sandbox
if shared_files:
for filename, content in shared_files.items():
await sandbox.files.write(f"/home/user/{filename}", content)
# Execute each step, maintaining state
for step_code in steps:
execution = await sandbox.notebook.exec_cell(step_code, timeout=60)
results.append({
"stdout": execution.text,
"error": bool(execution.error),
"outputs": [{"type": o.type, "data": str(o.data)[:1000]}
for o in execution.results],
})
# Stop if there's an error
if execution.error:
break
return results
# Real-world agent tool implementation
async def agent_code_execution_tool(
code: str,
context_files: dict[str, str] = None,
) -> dict:
"""
Tool function for AI agent code execution via E2B.
context_files: {filename: content} dict of files to pre-load
"""
try:
file_bytes = {k: v.encode() for k, v in (context_files or {}).items()}
result = await run_code_with_e2b(
code,
timeout_seconds=30,
)
if result["error"]:
return {
"status": "error",
"message": f"Code execution failed: {result['stderr']}",
"stdout": result["stdout"],
}
return {
"status": "success",
"stdout": result["stdout"][:5000], # Cap output size
"outputs": result["outputs"][:10], # Cap number of outputs
}
except TimeoutError:
return {"status": "timeout", "message": "Code execution exceeded 30 second limit"}
except Exception as e:
return {"status": "error", "message": f"Sandbox error: {str(e)}"}E2B architecture: Each sandbox is a Firecracker microVM started from a pre-built snapshot. The snapshot approach reduces cold start time to 100-300ms. The VM is not booted from scratch but resumed from a frozen state. This makes E2B practical for interactive agent loops where code execution tools are called frequently.
E2B security properties:
- Each sandbox is an isolated Firecracker microVM (no shared kernel with host or other sandboxes)
- Network access disabled by default (explicitly enabled with domain allowlisting)
- Filesystem is ephemeral (destroyed when the sandbox closes)
- CPU and memory limits enforced at hypervisor level
- Maximum sandbox lifetime: configurable, default 30 minutes
DIY Sandboxing: Implementing Your Own Secure Executor
For teams that cannot use managed services, a DIY sandbox can be built with Docker and seccomp profiles. This provides container-level isolation (not microVM-level) but is significantly more secure than direct subprocess execution.
import docker
import tempfile
import os
import json
from pathlib import Path
import asyncio
from typing import Optional
class DockerSandboxExecutor:
"""
Execute code in an isolated Docker container.
Uses seccomp profile and read-only mounts for hardening.
"""
# Seccomp profile: restrict dangerous syscalls
SECCOMP_PROFILE = {
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{
"names": [
"mount", "umount2", "ptrace", "strace",
"swapon", "swapoff", "reboot", "kexec_load",
"init_module", "delete_module", "create_module",
"clone", # Prevent fork bombs (limit with ulimit instead)
],
"action": "SCMP_ACT_ERRNO",
}
]
}
def __init__(self, base_image: str = "python:3.11-slim"):
self.client = docker.from_env()
self.base_image = base_image
async def execute(self,
code: str,
timeout_seconds: int = 30,
memory_mb: int = 512,
cpu_quota: int = 50000, # 50% of one CPU
network_enabled: bool = False,
environment: dict = None) -> dict:
"""Execute code in an isolated container."""
# Write code to a temp file for injection
with tempfile.NamedTemporaryFile(mode='w', suffix='.py',
delete=False) as f:
f.write(code)
code_file = f.name
try:
# Run container with restrictions
container = await asyncio.to_thread(
self.client.containers.run,
self.base_image,
command=["python", "/sandbox/code.py"],
volumes={code_file: {"bind": "/sandbox/code.py", "mode": "ro"}},
mem_limit=f"{memory_mb}m",
memswap_limit=f"{memory_mb}m", # No swap
cpu_quota=cpu_quota,
cpu_period=100000,
network_mode="none" if not network_enabled else "bridge",
read_only=True,
tmpfs={"/tmp": "size=64m,noexec"}, # Small writable /tmp, noexec
security_opt=[
"no-new-privileges",
f"seccomp={json.dumps(self.SECCOMP_PROFILE)}",
],
user="nobody", # Run as non-root
environment=environment or {},
remove=False, # Keep for log retrieval
detach=True,
)
# Wait for completion or timeout
try:
result = await asyncio.wait_for(
asyncio.to_thread(container.wait),
timeout=timeout_seconds,
)
exit_code = result["StatusCode"]
stdout = container.logs(stdout=True, stderr=False).decode()
stderr = container.logs(stdout=False, stderr=True).decode()
timed_out = False
except asyncio.TimeoutError:
container.kill()
stdout = container.logs(stdout=True, stderr=False).decode()
stderr = "Execution timed out"
exit_code = -1
timed_out = True
finally:
container.remove(force=True)
return {
"stdout": stdout[:10000], # Cap at 10KB
"stderr": stderr[:2000],
"exit_code": exit_code,
"timed_out": timed_out,
"success": exit_code == 0 and not timed_out,
}
finally:
os.unlink(code_file)
# Stricter: block all network access at the OS level using netns
def create_network_isolated_sandbox():
"""
Create a sandbox with network namespace isolation.
The container gets a new network namespace with only loopback.
"""
import subprocess
def run_in_isolated_network(code: str) -> dict:
script = f"""
import sys
import signal
signal.alarm(30) # 30 second timeout via SIGALRM
try:
exec(compile({repr(code)}, '<agent_code>', 'exec'))
except SystemExit:
pass
"""
result = subprocess.run(
["unshare", "--net", "--user", # New network and user namespace
"--map-root-user", # Map current user to root in namespace
"python3", "-c", script],
capture_output=True,
text=True,
timeout=35,
env={}, # Empty environment
)
return {
"stdout": result.stdout[:10000],
"stderr": result.stderr[:2000],
"exit_code": result.returncode,
"success": result.returncode == 0,
}
return run_in_isolated_networkResource Limits: CPU, Memory, Network, and Time
Every production sandbox needs explicit, enforced limits on all resource dimensions:
@dataclass
class ResourceLimits:
"""Resource limits for sandbox execution."""
# Compute
cpu_cores: float = 1.0
cpu_time_seconds: int = 30 # Wall clock timeout
cpu_quota_percent: int = 50 # Max % of CPU time
# Memory
ram_mb: int = 512
swap_mb: int = 0 # No swap by default
# I/O
disk_read_mb_per_s: int = 50
disk_write_mb_per_s: int = 20
max_output_bytes: int = 1_000_000 # 1MB output cap
# Process
max_processes: int = 10
max_file_descriptors: int = 50
# Network
network_enabled: bool = False
max_outbound_connections: int = 0 # 0 = none allowed
allowed_domains: list[str] = None
def enforce_python_resource_limits():
"""
Apply resource limits using Python's resource module.
Call this at the start of sandboxed Python execution.
"""
import resource
# CPU time limit (raises SIGXCPU after limit)
resource.setrlimit(resource.RLIMIT_CPU, (30, 30))
# Memory limit
memory_bytes = 512 * 1024 * 1024 # 512 MB
resource.setrlimit(resource.RLIMIT_AS, (memory_bytes, memory_bytes))
# File size limit
max_file_bytes = 50 * 1024 * 1024 # 50 MB
resource.setrlimit(resource.RLIMIT_FSIZE, (max_file_bytes, max_file_bytes))
# Process limit (fork bomb prevention)
resource.setrlimit(resource.RLIMIT_NPROC, (10, 10))
# File descriptor limit
resource.setrlimit(resource.RLIMIT_NOFILE, (50, 50))
def validate_code_before_execution(code: str) -> list[str]:
"""
Static analysis to flag potentially dangerous code.
Not a security substitute for sandboxing (a warning layer only).
"""
import ast
warnings = []
dangerous_patterns = [
"__import__", "importlib", "subprocess", "os.system",
"eval(", "exec(", "compile(",
"open(", "file(",
"socket", "urllib", "requests", "httpx",
"ctypes", "cffi",
"sys.exit", "os.kill", "signal.raise_signal",
]
for pattern in dangerous_patterns:
if pattern in code:
warnings.append(f"Potentially dangerous: '{pattern}' found in code")
# Try AST parse: catch syntax errors early
try:
ast.parse(code)
except SyntaxError as e:
warnings.append(f"Syntax error: {e}")
return warningsOutput size limits: Uncontrolled output can cause memory exhaustion in the agent's context window. A loop that prints 100MB of data causes multiple downstream failures. Cap stdout/stderr at a reasonable limit (1-10MB) and truncate with a message if exceeded.
Output Capture: Stdout, Stderr, Files, and Return Values
import io
import sys
import traceback
from contextlib import redirect_stdout, redirect_stderr
import ast
def execute_python_safely(code: str,
max_output_bytes: int = 100_000,
global_vars: dict = None,
local_vars: dict = None) -> dict:
"""
Execute Python code with full output capture.
NOT a security sandbox (use within a sandboxed environment).
Captures stdout, stderr, return value, and created variables.
"""
stdout_buffer = io.StringIO()
stderr_buffer = io.StringIO()
execution_globals = dict(global_vars or {})
execution_locals = dict(local_vars or {})
error = None
return_value = None
try:
with redirect_stdout(stdout_buffer), redirect_stderr(stderr_buffer):
# Try to get return value from last expression
tree = ast.parse(code, mode='exec')
if tree.body and isinstance(tree.body[-1], ast.Expr):
# Split: compile all but last expression, then eval last
last_expr = ast.Expression(body=tree.body[-1].value)
preceding = ast.Module(body=tree.body[:-1], type_ignores=[])
ast.fix_missing_locations(last_expr)
ast.fix_missing_locations(preceding)
exec(compile(preceding, '<sandbox>', 'exec'),
execution_globals, execution_locals)
return_value = eval(compile(last_expr, '<sandbox>', 'eval'),
execution_globals, execution_locals)
else:
exec(compile(tree, '<sandbox>', 'exec'),
execution_globals, execution_locals)
except Exception as e:
error = {
"type": type(e).__name__,
"message": str(e),
"traceback": traceback.format_exc(),
}
stdout = stdout_buffer.getvalue()
stderr = stderr_buffer.getvalue()
# Cap output sizes
if len(stdout) > max_output_bytes:
stdout = stdout[:max_output_bytes] + f"\n... [truncated at {max_output_bytes} bytes]"
# Collect any new variables defined in the execution
new_vars = {
k: v for k, v in execution_locals.items()
if k not in (local_vars or {})
and not k.startswith("_")
and not callable(v)
}
return {
"stdout": stdout,
"stderr": stderr,
"return_value": return_value,
"error": error,
"success": error is None,
"new_variables": {k: repr(v)[:200] for k, v in new_vars.items()},
}MultiTurn Code Execution: Maintaining State Across Steps
The most valuable aspect of sandboxed execution for agents is stateful multi-turn execution: variables defined in step 1 are available in step 2. This enables agents to incrementally build analysis, load data once and manipulate it across multiple steps, and debug by inspecting intermediate state.
class StatefulSandbox:
"""
Manages a persistent execution environment across multiple code cells.
Works with E2B or any persistent sandbox implementation.
"""
def __init__(self, sandbox_factory):
self.sandbox = sandbox_factory()
self.execution_history: list[dict] = []
self.defined_variables: dict[str, str] = {} # name → repr
async def execute_cell(self, code: str,
timeout_seconds: int = 30) -> dict:
"""Execute a code cell in the persistent environment."""
result = await self.sandbox.execute(code, timeout_seconds)
self.execution_history.append({
"code": code,
"result": result,
"step": len(self.execution_history) + 1,
})
# Track new variables from this execution
if result.get("new_variables"):
self.defined_variables.update(result["new_variables"])
return result
def get_context_summary(self) -> str:
"""
Generate a summary of the current execution context for the agent.
Shows the agent what variables exist and what has been done.
"""
lines = []
if self.defined_variables:
lines.append("Variables in scope:")
for name, repr_val in list(self.defined_variables.items())[:20]:
lines.append(f" {name} = {repr_val[:100]}")
if self.execution_history:
lines.append(f"\nExecution history: {len(self.execution_history)} cells executed")
last_result = self.execution_history[-1]["result"]
if last_result.get("error"):
lines.append(f"Last execution: ERROR - {last_result['error']['message']}")
else:
lines.append(f"Last execution: SUCCESS")
return "\n".join(lines) if lines else "No code executed yet."
async def reset(self):
"""Start a fresh execution environment."""
await self.sandbox.close()
self.sandbox = self.sandbox.__class__()
self.execution_history = []
self.defined_variables = {}Security Checklist: What to Audit Before Deploying
SANDBOX SECURITY AUDIT CHECKLIST
Isolation:
☐ Code runs in a separate process/container/VM from the host
☐ Filesystem access is restricted to a controlled directory
☐ Container/VM cannot access host filesystem via bind mounts
☐ Network access is disabled by default
☐ Process cannot spawn privileged children
Resource limits:
☐ Hard timeout enforced (not just soft signal)
☐ Memory limit enforced at OS/hypervisor level
☐ CPU limit enforced (prevents 100% CPU consumption)
☐ Disk write limit enforced (prevents disk exhaustion)
☐ Output size limit enforced (prevents context window flooding)
Input validation:
☐ Code size limit enforced (before execution)
☐ Static analysis warnings logged (not blocking, but informational)
☐ Code encoding validated (prevent injection via encoding tricks)
Output handling:
☐ stdout/stderr captured and size-limited before return
☐ File outputs validated before transfer to agent
☐ No secrets or host credentials accessible to sandbox
Multi-tenancy (if applicable):
☐ Each user/session gets a separate sandbox instance
☐ No shared filesystem state between user sandboxes
☐ Sandbox cleanup verified after session ends
Monitoring:
☐ Execution time logged for anomaly detection
☐ Memory usage logged
☐ Error patterns monitored for attack signatures
☐ Resource limit violations alerted onKey Takeaways
-
The minimum security requirements for AI code execution are: filesystem isolation (sandbox can only access a controlled directory), network isolation (outbound connections blocked by default), process isolation (no persistent processes after execution), resource limits (CPU, memory, disk, time), and no privilege escalation. Running AI-generated code without these controls is not a prototype shortcut. It is an active security incident waiting to happen.
-
Three isolation technologies provide progressively stronger security: containers (Linux namespaces, shared kernel, fastest), gVisor (user-space kernel intercepts syscalls, blocks kernel exploits), and Firecracker microVMs (hardware virtualization, full VM isolation, maximum security). For multi-tenant or user-input-driven code execution, use gVisor or Firecracker.
-
E2B provides managed Firecracker microVMs with cold start times of 100-300ms (via snapshot resumption), making it practical for interactive agent loops. The SDK handles the VM lifecycle, file transfer, output capture, and cleanup automatically. E2B is the fastest path to production-safe agent code execution.
-
Stateful multi-turn execution is the key capability that makes sandboxed code useful for agents: variables from step 1 are available in step 2. Use persistent sandboxes (or E2B's persistent CodeInterpreter context) for multi-step agent tasks. Track defined variables and pass context summaries to the agent at each step.
-
Static code analysis (pattern matching, AST analysis) is a useful warning layer but is not a security control. Determined attackers can bypass all pattern-based filters. The security guarantee comes from the sandbox isolation, not from rejecting dangerous-looking code patterns. Run static analysis for logging and alerting, not for access control.
-
Output size limits are critical and frequently overlooked. A print loop that generates 1GB of output causes context window flooding, agent confusion, and potential memory exhaustion in the orchestration layer. Cap stdout/stderr at 100KB-1MB, truncate with a clear message, and add per-character limits on file outputs transferred back to the agent.
FAQ
How do you safely run AIgenerated code in production?
Safely running AI-generated code in production requires isolating the execution in a sandbox with enforced security boundaries. The minimum requirements are: filesystem isolation (the code can only access a designated directory, not the host filesystem), network isolation (outbound connections blocked by default), process isolation (no persistent processes after execution), and resource limits (CPU time, memory, disk writes, execution timeout). For multi-tenant or user-facing deployments, use gVisor or Firecracker microVM isolation rather than standard Docker containers. Docker's shared kernel creates a risk of kernel-level escape exploits. Managed services like E2B (Firecracker-based) provide all these guarantees with a simple SDK and 100-300ms cold start times.
What is E2B and how is it used for AI agents?
E2B is a cloud service that provides on-demand isolated Linux sandboxes (Firecracker microVMs) designed for AI agent code execution. The Python SDK provides a CodeInterpreter context that handles VM lifecycle, code execution, output capture, and file transfer. Each sandbox is an isolated VM that cannot access the host system or other sandboxes. E2B sandboxes start quickly (100-300ms via snapshot resumption) making them practical for interactive agent loops. Key features: stateful execution (variables persist across multiple code cells within a session), file upload/download between the agent and sandbox, support for Python and Node.js, and configurable environment (custom packages, system dependencies). E2B is the fastest path to production-safe agent code execution for teams that don't want to build sandbox infrastructure themselves.
What is the difference between gVisor and standard Docker for AI code execution?
Standard Docker uses Linux namespaces for isolation but allows processes inside the container to make direct syscalls to the host kernel. If AI-generated code exploits a kernel vulnerability (or a container escape technique), it can access the host system. gVisor adds a user-space kernel layer (called Sentry) that intercepts all syscalls from the container before they reach the host kernel. The syscalls are implemented in Go/Rust in user space, validated, and then translated to a minimal set of host syscalls. This eliminates most kernel exploit vectors because the container's code never directly touches the host kernel. The tradeoff is 10-30% performance overhead for syscall-heavy workloads (network I/O, file I/O), generally acceptable for the security guarantee provided. For AI code execution with user-provided inputs, gVisor is strongly preferred over standard Docker.
AI agents with code execution capabilities are more powerful than agents without them. An agent that can run code can process arbitrary data, test its own hypotheses, automate tasks that would otherwise require brittle string manipulation, and verify its outputs against hard facts. The capability is real and significant.
The security debt from treating code execution as another tool call is also real. The history of security incidents in adjacent domains (server-side template injection, eval() misuse, arbitrary file read in web apps) follows the same pattern: someone found a way to get user input into an execution context that didn't have appropriate boundaries. AI agents make this attack surface larger because the path from user input to code execution is shorter.
The practical answer is the same as in every security context: assume the code will try to do things you don't want, and make it technically impossible rather than unlikely. The sandbox technologies are mature, the managed services are fast and cheap, and the implementation patterns are established. There is no reason to deploy AI code execution without appropriate isolation.
Build the sandbox first. Add the capability second. The order matters.
Written & published by Chaitanya Prabuddha