Accelerating ML Development with Multi-Modal Datasets: Leveraging Python, Parquets, and Daft

Guy Pozner Guy Pozner
Language: Hebrew
video in Hebrew
The presentation was given on 2024.09.16 at PyCon Israel 2024 - Conference.

Mobileye accelerates ML development with multi-modal datasets using Python, Parquets, and Daft. We will cover dataset formats, Daft’s capabilities, its usage examples, and its integration into Mobileye's cloud-native architecture.

Large-scale multi-modal datasets play a pivotal role in the development of computer vision solutions based on deep learning algorithms. Historically, different data formats have been used at various stages of the ML development life cycle to optimize specific tasks. During training, a sequential pass over the entire dataset is essential, while validation involves map-reduce operations.

In this talk, we’ll delve into how Mobileye leverages Python, Parquets, and Daft to streamline the AI development life cycle. We’ll explore the following key aspects: - Dataset Formats and Reading Options - We’ll discuss various options for representing multi-modal datasets, and how choosing the right format impacts training and validation efficiency. - What Is Daft - Daft is a high-performance Python query engine designed for handling complex, multimodal data types. Its Rust core engine executes operations lazily via Daft’s Expressions API. Additionally, Daft seamlessly integrates with essential Python libraries like PyTorch and offers efficient cloud storage integration. - Examples of Daft Usage - Through practical examples, we’ll demonstrate how Daft simplifies working with multi-modal datasets. From loading data to preprocessing. - Daft in a Cloud-Native Architecture - Learn how Daft seamlessly integrates into Mobileye’s cloud infrastructure. We’ll explore its role in enabling multi-tenant access to datasets, ensuring fast data retrieval, and maintaining a single source of truth.

Throughtout the talk We’ll discuss real-world scenarios showing how Daft accelerates the AI development lifecycle. Attendees will gain insights into best practices for leveraging Daft effectively.