Best Practices for Building AI Automation Agents
AI automation agents run multi-step workflows — scrape data, process it, call APIs, generate reports. They fail more often than chatbots because they have more moving parts. Here's how to build them reliably.
1. Use durable execution
The most important pattern for automation agents. Every successful step gets recorded in a journal. When the agent crashes — and it will — it replays the journal to recover cached results and continues from the last successful step.
Without durable execution, a crash at step 8 of 10 means re-running all 10 steps. With it, steps 1-7 replay instantly from cache, and only step 8 re-executes.
// Each step is cached — crash at any point, resume from last success
const data = await ctx.journal.durable("scrape", () =>
scrapeWebsite(url)
);
const processed = await ctx.journal.durable("process", () =>
processData(data)
);
const report = await ctx.journal.durable("generate-report", () =>
generateReport(processed)
);
await ctx.journal.durable("send-email", () =>
sendEmail(report, recipient)
);2. Isolate per customer
Each customer has different data sources, different configurations, different schedules. If customer A's automation fails, it shouldn't affect customer B. Run each customer's workflow in its own isolated environment.
This also simplifies debugging. When something breaks, you know exactly which customer's environment to look at. The journal shows every step that ran, what succeeded, and where it failed.
3. Store intermediate results
Don't pipeline everything in memory. Write intermediate results to storage after each major step. If the process restarts, you don't need to re-scrape 10,000 pages — the scraped data is on disk.
// Store intermediate results on persistent storage
ctx.store.write("scraped/page-1.json", JSON.stringify(result));
// Next step reads from storage, not from memory
const pages = ctx.store.list("scraped/");
for (const page of pages) {
const data = JSON.parse(ctx.store.read(`scraped/${page}`));
// process...
}4. Set timeouts on everything
LLM calls timeout. HTTP requests hang. Shell commands run forever. Every external call needs a timeout. Every shell command needs a max execution time. If you don't set timeouts, one hung request blocks the entire workflow indefinitely.
5. Make workflows idempotent
Your automation will re-run steps due to crashes, retries, or duplicate triggers. Every step must be safe to run twice. Sending an email twice is bad. Writing a file twice is fine. Design accordingly.
Use the journal for deduplication. Before running a step, check if it already has a cached result. If it does, skip it.
6. Stream progress to a dashboard
Automation agents run in the background. Your users need visibility — what step is running, how long has it been, did it fail. Stream status updates to a dashboard in real time.
ctx.stream({ step: 1, total: 5, status: "scraping", url });
// ... work ...
ctx.stream({ step: 2, total: 5, status: "processing", records: 1500 });
// ... work ...
ctx.stream({ step: 3, total: 5, status: "generating report" });7. Schedule with pause/resume, not cron
Traditional cron jobs spin up a new process every time. With AI agents, you want to resume the same environment — same files, same database, same state. Pause the agent between runs, resume when it's time to work.
Paused environments cost almost nothing ($0.003/hr). Active environments cost normal compute rates. You only pay when the agent is actually doing work.
Build with OnCell
Durable execution, per-customer isolation, persistent storage, and streaming — all built in.