Cost-Efficient AI Systems in Practice

Michael Levinger

Language: Hebrew

The presentation was given on 2026.06.14 at PyData Tel Aviv @ Melio.

Deploying large language models and AI agents in real-world systems requires a constant trade-off between cost, latency, and performance. This talk explores how to optimize LLM- and agent-based systems using techniques such as caching, model routing/cascades, tuning, RAG, and distillation—significantly reducing costs without sacrificing quality. Through a case study of an ATO-agent system, we’ll also cover practical approaches to cost estimation, monitoring, and budgeting. In addition, we’ll compare leading industry models—such as Gemini, Claude, and GPT—focusing on differences in response speed, cost efficiency, and real-world performance, and how to choose the right model for each use case within broader agentic workflows.