Understanding AWS Lambda Durable Functions: A Technical Overview

2025-12-074 min read

The serverless architecture of AWS Lambda provides incredible scale and flexibility, but its stateless nature creates challenges for building workflows that need to maintain state over time. The "Durable Function" pattern is an approach to solve this, enabling developers to create stateful, long-running orchestrations using Lambda functions. This article provides a technical overview of the core components and mechanics behind AWS Lambda Durable Functions.

The APIs Under the Hood

A durable function's state and lifecycle are managed by a set of powerful backend APIs. The Durable Execution SDK interacts with these APIs to orchestrate the workflow.

CheckpointDurableExecution

This API is used to save the progress of a durable function execution. The SDK calls this to checkpoint completed steps and schedule asynchronous operations, such as timers or external function calls. Each checkpoint operation consumes the current checkpoint token and returns a new one, ensuring that state updates are applied in the correct sequence and preventing duplicates.

GetDurableExecution

Used to retrieve detailed information about a specific durable execution. It returns the execution's current status, its input payload, the final result or error, and metadata like start time and usage statistics. The only required input is the durable execution ARN.

GetDurableExecutionHistory

This retrieves the complete execution history for a durable execution, showing all steps, callbacks, and events that occurred. This provides a detailed audit trail of the execution's progress, which is invaluable for debugging and analysis.

GetDurableExecutionState

A critical API for the replay process. The SDK calls this to retrieve the current execution state required to resume a function. It takes the durable execution ARN and a checkpoint token and returns the sequence of operations. Completed parent operations do not include their children's details, as those children do not need to be replayed.

ListDurableExecutionsByFunction

Returns a list of all durable executions associated with a specific Lambda function, allowing for monitoring and administrative tasks.

SendDurableExecutionCallbackSuccess

Used to send a successful response for a callback operation within a durable execution. An external system would use this API when it has successfully completed a task that the durable function was waiting for. The callbackId is used to route the response to the correct execution.

SendDurableExecutionCallbackFailure

Similar to the success API, this sends a failure response for a callback operation. This is used when an external system cannot complete its task successfully, allowing the durable function to handle the error.

StopDurableExecution

This API is used to force-stop a durable execution. The execution transitions to a STOPPED status and cannot be resumed. Any operations that were in progress are terminated.

How Lambda Durable Functions Work

sequenceDiagram
    participant Client
    participant Lambda Runtime
    participant Durable SDK
    participant Durable Service

    Client->>Lambda Runtime: 1. Trigger function
    Lambda Runtime->>Durable SDK: 2. Start Execution
    Durable SDK->>Durable Service: 3. GetDurableExecutionState
    Durable Service-->>Durable SDK: Return Execution History

    Durable SDK->>Durable SDK: 4. Execute user code & replay history
    Note right of Durable SDK: Skips completed steps

    Durable SDK->>Durable SDK: Reaches a new async operation
    Durable SDK->>Durable Service: 5. CheckpointDurableExecution
    Durable Service-->>Durable SDK: Acknowledge Checkpoint
    Note over Lambda Runtime, Durable Service: Lambda execution is paused.

    %% ... Some time later, the async operation completes ... %%

    Durable Service->>Lambda Runtime: 7. Resume Execution (e.g., timer fires)
    Lambda Runtime->>Durable SDK: Start Execution
    Durable SDK->>Durable Service: GetDurableExecutionState
    Durable Service-->>Durable SDK: Return Updated History
    Durable SDK->>Durable SDK: Execute user code & replay to continue
    Note right of Durable SDK: Resumes from where it left off.

The core of the durable function pattern is a "checkpoint and replay" mechanism, managed by the SDK. Here is the typical execution flow:

  1. Initiation: The AWS Lambda runtime initiates the function execution.
  2. State Retrieval: The Durable Execution SDK immediately calls GetDurableExecutionState to retrieve the current execution history from the Lambda service.
  3. Checkpoint Manager: A checkpoint manager is initialized within the SDK, loaded with the retrieved history.
  4. User Code Execution: The user's function code is executed from the beginning. The SDK intercepts calls to durable operations. If an operation is already in the history, the SDK returns the recorded result without re-executing it.
  5. Persisting Results: When the code performs a new asynchronous operation, the result is persisted by the checkpoint manager via a CheckpointDurableExecution call.
  6. Continuation and Retry: The Lambda execution continues. If a failure occurs, the function can be retried with the same durable execution ARN. On retry, the SDK will load the last checkpoint and iterate through the history to determine which steps are already completed, allowing it to resume from where it left off.
  7. Handling Pauses: For operations that require waiting (e.g., a timer, a callback, or another function invocation), the process is slightly different. When the SDK creates a checkpoint for such an operation, the Lambda service may stop the execution of the current function. It will automatically resume the function only when the awaited event occurs (e.g., the timer completes or the callback is received).

This cycle of executing, checkpointing, and replaying from history is what gives the function its "durable" quality, allowing a stateless Lambda function to orchestrate a stateful, long-running workflow.