Durable Python: Running Reliable Workflows on Unreliable Infrastructure

Haim Zlatokrilov

Language: Hebrew

The presentation was given on 2025.09.09 at PyCon Israel 2025 - Conference.

Servers crash, containers restart, and services fail. This talk introduces Durable Python: a way to make workflows survive infrastructure failures, crucial for distributed systems, using AST tricks and durable execution platforms.

Modern Python workflows are more distributed than ever: orchestrating APIs, cloud services, microservices, and databases across environments where infrastructure is not always reliable. When a server crashes, a container restarts, or a service call fails mid-process, traditional Python scripts often must restart from scratch, risking duplicated work, data inconsistencies, or lost progress.

Durable Python changes this. It introduces a model where workflow state is preserved, and execution can automatically resume from the point of failure, without manual recovery, complex retry logic, or redundant operations. This talk will cover: Why infrastructure failures are inevitable — and why Python needs built-in durability to handle them.

The core principles of durable execution: state persistence, fault recovery, and reliable orchestration.

Practical examples and patterns for introducing durability into real-world Python automations, including CI pipelines, DevOps processes, microservice orchestration, and long-running AI agents.

How Durable Python works under the hood — leveraging durable execution platforms, transforming Python's AST, converting non-deterministic calls into trackable activities (like automatic checkpoints), and orchestrating everything seamlessly.

We’ll also live-demo a long-running Python workflow, simulate real infrastructure failures, and show the process resuming exactly where it left off — without re-executing completed steps or losing state.