← Back to blog
April 16, 2026 · 5 min read

Best Practices for Building AI Data Analysis Agents

"Chat with your data" products are everywhere — upload a CSV, ask questions, get insights. But building one that's secure, fast, and handles real-world data is surprisingly hard.

1. Never run user code in a shared environment

Your AI agent generates Python code to analyze the user's data — pandas, SQL, matplotlib. That code runs on a server. If two users share the same process, one user's code can read another user's data. This is not a theoretical risk — it's a guaranteed data breach.

Each user needs their own sandboxed environment. Their data goes in, their code runs, results come out. No cross-user access possible — not through the filesystem, not through environment variables, not through process memory.

2. Keep data local to the compute

The typical architecture is: data in S3, compute in Lambda or ECS, database in RDS. Every analysis query hits the network three times. For interactive "ask a question, get an answer" experiences, that latency kills the product.

Co-locate data and compute. When the user uploads a CSV, store it on the same machine that runs the analysis. File reads become 0ms instead of 50ms. For a pandas operation that reads the file 10 times, that's 500ms saved per query.

// Data and compute on the same NVMe
ctx.store.write("data/sales.csv", uploadedCsv);

// 0ms file read — no network
const result = ctx.shell(`python -c "
import pandas as pd
df = pd.read_csv('/cell/work/data/sales.csv')
print(df.describe().to_json())
"`);

3. Persist analysis state across sessions

Users don't finish analysis in one sitting. They upload data, explore, leave, come back tomorrow. If you use ephemeral containers, they start over every time — re-uploading data, re-asking questions, losing context.

Persist everything: uploaded files, computed columns, cached aggregations, conversation history. When the user returns, they pick up exactly where they left off.

4. Cache expensive computations

Some analyses take 30 seconds — large aggregations, complex joins, ML model training. Don't re-run them every time the user asks a follow-up question. Cache the result in the database, keyed by the query hash.

const cacheKey = `analysis:${hash(query)}`;
const cached = ctx.db.get(cacheKey);
if (cached) return cached;

const result = await runAnalysis(query);
ctx.db.set(cacheKey, result);
return result;

5. Stream insights as they're discovered

Don't wait for the full analysis to complete. If your agent finds a trend in the first 2 seconds, show it immediately. Stream partial results — "Revenue is up 23%" — while the full analysis continues in the background.

6. Validate generated code before execution

The LLM will sometimes generate code that deletes files, makes network requests, or runs infinite loops. Before executing AI-generated code, check for dangerous patterns — import restrictions, no network access, execution timeouts.

Better yet, run in a sandbox with resource limits. gVisor, Firecracker, or similar. The sandbox prevents escape even if the generated code is malicious.

7. Handle messy data gracefully

Real CSVs have missing values, inconsistent types, encoding issues, and duplicate headers. Your agent will crash on the first real dataset if you don't handle this.

Pre-process uploaded files: detect encoding, handle BOM, normalize column names, infer types. Show the user a preview before analysis starts. Let the agent ask clarifying questions — "Column 'revenue' has 15% missing values. Should I fill with 0, mean, or skip those rows?"

Build with OnCell

Per-user data isolation, sandboxed code execution, persistent storage, and streaming — all built in.