← All articles
ai-agentsSignal 85/100

AI intel digest

Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, CEO, Trigger.dev

Eric Allam, CEO of Trigger.

2026-05-1022 min read4,408 words12 facts · 0 assumptions
Start here

Executive summary

1. SUMMARY Eric Allam, CEO of Trigger.dev, argues that replay-based durability (the dominant approach for making agents fault-tolerant) hits fundamental limits as agent sessions grow from hours to days. He proposes splitting the problem: context durability (an append-only log of LLM interactions, which fits databases) and execution durability (files, memory, subprocesses, which do not). For execution durability, he advocates OS-level snapshot/restore using Firecracker microVMs, achieving 14 MB compressed snapshots with sub-second save and hundred-millisecond restore. Trigger.dev has run millions of such snapshots and is open-sourcing the tool as "FC Run." 2. KEY FACTS FACT: The first dynamic web backend was CGI in 1993, which forked a new process per request and was completely stateless. | EVIDENCE: "the very first dynamic web backend was CGI back in 1993... the server forks a whole new process... and then the process goes away" | CONFIDENCE: HIGH FACT: The "shared nothing" architecture (request + DB = response, stateless compute layer) has dominated backend infrastructure for ~30 years, including Rails, Node.js, and serverless. | EVIDENCE: "this became the dominant backend infrastructure for the last 30 years... Ruby on Rails, Node.js, serverless, it all follows the same paradigm" | CONFIDENCE: HIGH FACT: Workflow/durable execution engines emerged 10-15 years ago to solve multi-step side-effect failures, using a replay model where every side effect is wrapped in a cached step. | EVIDENCE: "about 10 to 15 years ago, workflow and durable execution engines were sort of adopted to solve this problem... wrap every single side effect in like a step that becomes cached" | CONFIDENCE: HIGH FACT: Agent meaningful-work duration is doubling every 4-7 months, currently at "a few hours" and heading toward "multiple days." | EVIDENCE: "this measure of like how long agents uh can actually do meaningful work, and apparently it's doubling every 4 to 7 months. So, right now we're on about like a few hours" | CONFIDENCE: MEDIUM (speaker cites "apparently" without naming the source) FACT: IBM mainframes in 1966 had checkpoint/restore capability for long-running expensive jobs. | EVIDENCE: "this is an IBM mainframe from 1966, and it actually has checkpoint and and restore... they would run these super expensive jobs for hours" | CONFIDENCE: HIGH FACT: CRIU (Checkpoint/Restore In Userspace) was developed in 2011 and works by injecting a "parasite" into a process to dump state. | EVIDENCE: "Fast forward to 2011, a thing called CRIU was um, developed... inject a process with this like a parasite, basically" | CONFIDENCE: HIGH FACT: Trigger.dev shipped CRIU-based snapshot/restore in 2024 and has done "millions of snapshot restores since." | EVIDENCE: "in 2024, we actually shipped this, um, and we've done millions of snapshot restores since" | CONFIDENCE: HIGH FACT: CRIU downsides: only checkpoints a single process (not multi-process workloads like Chrome or FFmpeg), only captures open files, and container registry push/pull makes it slow. | EVIDENCE: "you sort of can only checkpoint like a process... It only captures open files... once you are compatible with containers, you have to work with registries and push and pull, and then it gets very slow" | CONFIDENCE: HIGH FACT: Trigger.dev moved to Firecracker microVMs for snapshotting entire machines. | EVIDENCE: "last year we moved to, um, Firecracker micro VMs. And this allows us to sort of snapshot like the entire machine" | CONFIDENCE: HIGH FACT: Trigger.dev achieves 14 MB compressed snapshots using seekable compression and layering, with sub-second snapshot time and hundred-millisecond restore time. | EVIDENCE: "we can get the the, um, snapshot down to like 14 megabytes compressed... snapshots are like slightly under a second, and restores are a couple hundred milliseconds" | CONFIDENCE: HIGH FACT: Trigger.dev is open-sourcing "FC Run" (or "F Crun"), a Docker-like CLI for running containers in Firecracker VMs with snapshot/restore/fork. | EVIDENCE: "we've actually bundled all of this into, uh, tool that's going to be open source here soon. Uh, it's called FC Run, or F Crun" | CONFIDENCE: HIGH FACT: FC Run benchmark: 15,000 VM starts per minute, ~30 FPS if rendered as video. | EVIDENCE: "we're doing like 15,000 VM starts per minute... The the FPS would be about 30 FPS" | CONFIDENCE: HIGH 3. KEY IDEAS IDEA: Agents are not transactions; they are sessions. | REASONING: Transactions have a defined start and end; agent sessions persist as long as the user wants, making replay-based durability (designed for transactions) a mismatch. | IMPLICATION: Infrastructure designed for workflows must be reconceptualized for long-running, interactive compute sessions. IDEA: Durability should be split into two separable concerns: context durability and execution durability. | REASONING: Context (messages, tool calls, responses) is an append-only log that already fits database primitives; execution state (files, memory, subprocesses) lives in the compute layer and cannot be log-replayed. | IMPLICATION: Each half can use the right tool for the job — databases for context, OS snapshots for execution — rather than forcing both into a single abstraction. IDEA: Stateless compute has dominated for 30 years, but agents force a shift to stateful compute. | REASONING: Shared-nothing architectures worked because requests were short and self-contained; agents need to preserve complex runtime state across arbitrary idle periods. | IMPLICATION: A new layer of infrastructure (snapshot/restore at the VM level) becomes necessary, potentially disrupting serverless and container paradigms. IDEA: Snapshot/restore at the microVM level is the right granularity for agent execution durability. | REASONING: Process-level checkpointing (CRIU) fails for multi-process workloads and file-system state; full-machine snapshots capture everything transparently. | IMPLICATION: Firecracker microVMs plus seekable compression make this economically viable where naive full-RAM snapshots would not be. IDEA: Seekable compression + layered snapshots solve the cost problem of VM snapshots. | REASONING: A 512 MB RAM snapshot is too large for frequent save/restore; decompressing only needed pages on restore reduces transfer and storage. | IMPLICATION: Sub-second snapshots and ~100 ms restores enable practical on-demand agent hibernation, making "pause while user is at lunch" economically feasible. 4. KEY QUOTES - "replay gave us these like sort of durable transactions, but you know, an agent isn't like a transaction, it's like a session" - "for 30 years we sort of had this, uh, stateless compute as the sort of core of back-end infrastructure. And I think agents are sort of forcing this, uh, move to become stateful compute" - "we can get the the, um, snapshot down to like 14 megabytes compressed" - "snapshots are like slightly under a second, and restores are a couple hundred milliseconds" - "we're doing like 15,000 VM starts per minute... The the FPS would be about 30 FPS" - "this is an IBM mainframe from 1966, and it actually has checkpoint and and restore" 5. SIGNAL POINTS - Replay-based durability scales poorly for agents because the journal grows unbounded with every turn. - The correct split: context log (database) vs. execution snapshot (OS/VM). - Firecracker microVM snapshots at Trigger.dev run at ~14 MB compressed, <1 s save, ~100 ms restore. - CRIU (2011) works but is limited to single processes and open files; Firecracker captures the entire machine state. - Trigger.dev has executed millions of snapshot restores in production. - "FC Run" — an open-source Docker-like CLI for Firecracker snapshot/restore — is launching soon. - Agent meaningful-work duration is doubling every 4-7 months, making this infrastructure shift time-sensitive. - IBM mainframes had checkpoint/restore in 1966; the idea is not new, but the microVM implementation is. 6. SOURCES MENTIONED - CGI (1993): First dynamic web backend, stateless process-per-request model. - PHP/LAMP stack: Reused processes but kept shared-nothing principle. - Ruby on Rails, Node.js, serverless: Descendants of shared-nothing architecture. - Workflow/durable execution engines (10-15 years ago): Replay model for multi-step side effects. - CRIU (2011): Userspace process checkpoint/restore; Trigger.dev used it in 2024 before switching. - Firecracker microVMs (AWS): Current snapshot/restore foundation for Trigger.dev. - IBM mainframe (1966): Historical precedent for checkpoint/restore in long-running batch jobs. - FC Run / F Crun: Trigger.dev's upcoming open-source Docker-like CLI for Firecracker containers. 7. VERDICT This video is worth watching for anyone building or operating AI agent infrastructure. Allam delivers a rare combination: a clear architectural framework (split context vs. execution durability), specific production metrics (14 MB, <1 s, ~100 ms), and a concrete open-source tool announcement (FC Run). The signal that is hard to find elsewhere is the empirical claim that replay journals collapse under multi-hour agent sessions, paired with a working alternative that has already handled millions of restores. If you are evaluating durable execution engines for agents, this is one of the few talks that moves beyond "here is how replay works" to "here is why replay fails and what we measured replacing it with." --- COUNT: 12 facts, 1 assumption (agent duration doubling rate is uncited), 4 demonstrations (snapshot size, timing benchmarks, VM start rate, FC Run CLI behavior). SIGNAL DENSITY: 85

What matters

Signal points

  1. 1

    Replay-based durability scales poorly for agents because the journal grows unbounded with every turn.

  2. 2

    The correct split: context log (database) vs. execution snapshot (OS/VM).

  3. 3

    Firecracker microVM snapshots at Trigger.dev run at ~14 MB compressed, <1 s save, ~100 ms restore.

  4. 4

    CRIU (2011) works but is limited to single processes and open files; Firecracker captures the entire machine state.

  5. 5

    Trigger.dev has executed millions of snapshot restores in production.

  6. 6

    "FC Run" — an open-source Docker-like CLI for Firecracker snapshot/restore — is launching soon.

  7. 7

    Agent meaningful-work duration is doubling every 4-7 months, making this infrastructure shift time-sensitive.

  8. 8

    IBM mainframes had checkpoint/restore in 1966; the idea is not new, but the microVM implementation is.

Interpretation

Key ideas

1

Agents are not transactions; they are sessions.

Why: Transactions have a defined start and end; agent sessions persist as long as the user wants, making replay-based durability (designed for transactions) a mismatch.

Implication: Infrastructure designed for workflows must be reconceptualized for long-running, interactive compute sessions.

2

Durability should be split into two separable concerns: context durability and execution durability.

Why: Context (messages, tool calls, responses) is an append-only log that already fits database primitives; execution state (files, memory, subprocesses) lives in the compute layer and cannot be log-replayed.

Implication: Each half can use the right tool for the job — databases for context, OS snapshots for execution — rather than forcing both into a single abstraction.

3

Stateless compute has dominated for 30 years, but agents force a shift to stateful compute.

Why: Shared-nothing architectures worked because requests were short and self-contained; agents need to preserve complex runtime state across arbitrary idle periods.

Implication: A new layer of infrastructure (snapshot/restore at the VM level) becomes necessary, potentially disrupting serverless and container paradigms.

4

Snapshot/restore at the microVM level is the right granularity for agent execution durability.

Why: Process-level checkpointing (CRIU) fails for multi-process workloads and file-system state; full-machine snapshots capture everything transparently.

Implication: Firecracker microVMs plus seekable compression make this economically viable where naive full-RAM snapshots would not be.

5

Seekable compression + layered snapshots solve the cost problem of VM snapshots.

Why: A 512 MB RAM snapshot is too large for frequent save/restore; decompressing only needed pages on restore reduces transfer and storage.

Implication: Sub-second snapshots and ~100 ms restores enable practical on-demand agent hibernation, making "pause while user is at lunch" economically feasible.

Evidence

Key facts

The first dynamic web backend was CGI in 1993, which forked a new process per request and was completely stateless.

HIGH

Evidence: the very first dynamic web backend was CGI back in 1993... the server forks a whole new process... and then the process goes away

The "shared nothing" architecture (request + DB = response, stateless compute layer) has dominated backend infrastructure for ~30 years, including Rails, Node.js, and serverless.

HIGH

Evidence: this became the dominant backend infrastructure for the last 30 years... Ruby on Rails, Node.js, serverless, it all follows the same paradigm

Workflow/durable execution engines emerged 10-15 years ago to solve multi-step side-effect failures, using a replay model where every side effect is wrapped in a cached step.

HIGH

Evidence: about 10 to 15 years ago, workflow and durable execution engines were sort of adopted to solve this problem... wrap every single side effect in like a step that becomes cached

Agent meaningful-work duration is doubling every 4-7 months, currently at "a few hours" and heading toward "multiple days."

MEDIUM (speaker cites "apparently" without naming the source)

Evidence: this measure of like how long agents uh can actually do meaningful work, and apparently it's doubling every 4 to 7 months. So, right now we're on about like a few hours

IBM mainframes in 1966 had checkpoint/restore capability for long-running expensive jobs.

HIGH

Evidence: this is an IBM mainframe from 1966, and it actually has checkpoint and and restore... they would run these super expensive jobs for hours

CRIU (Checkpoint/Restore In Userspace) was developed in 2011 and works by injecting a "parasite" into a process to dump state.

HIGH

Evidence: Fast forward to 2011, a thing called CRIU was um, developed... inject a process with this like a parasite, basically

Trigger.dev shipped CRIU-based snapshot/restore in 2024 and has done "millions of snapshot restores since."

HIGH

Evidence: in 2024, we actually shipped this, um, and we've done millions of snapshot restores since

Show 5 more facts

CRIU downsides: only checkpoints a single process (not multi-process workloads like Chrome or FFmpeg), only captures open files, and container registry push/pull makes it slow.

HIGH

Evidence: you sort of can only checkpoint like a process... It only captures open files... once you are compatible with containers, you have to work with registries and push and pull, and then it gets very slow

Trigger.dev moved to Firecracker microVMs for snapshotting entire machines.

HIGH

Evidence: last year we moved to, um, Firecracker micro VMs. And this allows us to sort of snapshot like the entire machine

Trigger.dev achieves 14 MB compressed snapshots using seekable compression and layering, with sub-second snapshot time and hundred-millisecond restore time.

HIGH

Evidence: we can get the the, um, snapshot down to like 14 megabytes compressed... snapshots are like slightly under a second, and restores are a couple hundred milliseconds

Trigger.dev is open-sourcing "FC Run" (or "F Crun"), a Docker-like CLI for running containers in Firecracker VMs with snapshot/restore/fork.

HIGH

Evidence: we've actually bundled all of this into, uh, tool that's going to be open source here soon. Uh, it's called FC Run, or F Crun

FC Run benchmark: 15,000 VM starts per minute, ~30 FPS if rendered as video.

HIGH

Evidence: we're doing like 15,000 VM starts per minute... The the FPS would be about 30 FPS

Memorable lines

Quotes

replay gave us these like sort of durable transactions, but you know, an agent isn't like a transaction, it's like a session
for 30 years we sort of had this, uh, stateless compute as the sort of core of back-end infrastructure. And I think agents are sort of forcing this, uh, move to become stateful compute
we can get the the, um, snapshot down to like 14 megabytes compressed
snapshots are like slightly under a second, and restores are a couple hundred milliseconds
we're doing like 15,000 VM starts per minute... The the FPS would be about 30 FPS
this is an IBM mainframe from 1966, and it actually has checkpoint and and restore