Best Practices for Building AI Coding Agents
AI coding agents — tools like Cursor, Bolt, and Devin — are the fastest-growing category in AI. They edit files, run tests, fix bugs, and generate entire applications. But building one that actually works in production is harder than it looks.
Here are the practices that matter most, based on what breaks first.
1. Isolate every user
This is non-negotiable. Each user has their own codebase, their own git state, their own dependencies. If two users share a filesystem, you get race conditions, data leaks, and impossible-to-debug failures.
The right pattern is one sandboxed environment per user. Not a shared container with user directories — a separate process with its own filesystem, its own shell, its own state. If one user's agent crashes, others are unaffected.
// Each user gets their own isolated environment
const cell = await oncell.cells.create({
customerId: userId,
agent: codingAgentCode,
});
// user's files, shell, and state are completely isolated2. Make the filesystem persistent
Ephemeral sandboxes are fine for one-shot code execution. But coding agents need to maintain state across sessions — the user's repo, installed dependencies, test results, agent memory.
If you use ephemeral containers, you're re-cloning repos, re-installing dependencies, and losing context every session. That's 30-60 seconds of cold start that destroys the user experience.
Use persistent storage — ideally local NVMe, not network-attached. The difference between 0ms file reads and 5ms file reads compounds when your agent reads 50 files per request.
3. Add code search, not just file listing
The LLM's context window is limited. You can't dump the entire codebase into the prompt. You need to find the relevant files — and "relevant" means semantic similarity, not just filename matching.
Index the codebase for search. When the user says "fix the authentication bug," your agent should find the auth-related files, not list every file in the repo.
// Search the codebase for relevant context
const files = ctx.search.query("authentication middleware");
// Returns ranked results: auth.ts, middleware.ts, login.ts
// Feed these into the LLM prompt4. Stream progress, don't batch
Coding agents are slow — they read files, think, write code, run tests. If you wait for the entire operation to complete before showing anything, the user thinks it's frozen.
Stream every step. Show "reading auth.ts...", "writing fix...", "running tests..." as they happen. The user stays engaged and can interrupt if the agent goes off track.
ctx.stream({ step: "reading", file: "auth.ts" });
const content = ctx.store.read("src/auth.ts");
ctx.stream({ step: "generating", description: "fixing token validation" });
const fix = await callLLM(content, instruction);
ctx.stream({ step: "writing", file: "auth.ts" });
ctx.store.write("src/auth.ts", fix);
ctx.stream({ step: "testing" });
const result = ctx.shell("npm test");5. Implement crash recovery
Your agent will crash. The LLM will timeout. The shell command will hang. Network will blip. This is not an edge case — it happens on every 10th request at scale.
Use a durable execution pattern. Record each successful step in a journal. When the agent restarts, replay the journal to recover state and continue from the last successful step.
Without this, a crash at step 8 of 10 means re-running 7 steps that already succeeded. With a journal, it means replaying cached results and continuing from step 8.
6. Limit shell execution
Your agent has shell access. It will try to run rm -rf / at some point — because the LLM suggested it or because the agent hallucinated. Set timeouts, resource limits, and use a sandboxed runtime like gVisor.
7. Keep agent code separate from user code
Store your agent logic in a different directory than the user's codebase. If the agent edits its own code by accident, you get an infinite loop of self-modification.
The pattern: agent code in /data/agent.js, user code in /work/. The agent reads and writes to /work/ but never touches its own code.
8. Auto-pause idle environments
Most users are active for a few minutes, then leave for hours or days. If you keep their environment running 24/7, your infrastructure bill explodes. Auto-pause after 15 minutes of inactivity. Resume in <1 second when they return.
The key is that resume must preserve all state — files, database, search index, agent memory. If resuming feels like a cold start, you've defeated the purpose.
Build with OnCell
OnCell handles isolation, persistence, search, streaming, crash recovery, and auto-pause. You write the agent logic.