PyCon Israel 2022 - Conference
Schedule
In this panel, we will ask industry leaders about their python strategy and usage patterns
- Michael Czeizler (Mobileye)
- Barak Peleg (AI21)
- Orit Wasserman (Red Hat)
- Lior Mizrahi (BlueVine)
- Oren Nissenbaum (Via)
- Miki Tebeka (353 Solutions)
- Tal Sayhonov (CYE)
Have you ever written a simple function, and added it to your pipeline only to discover it is WAY slower than it should be? In this talk, I will demonstrate how to sniff out functions that slow down your pipeline and be proactive about speeding up
Have you ever written a short and simple function, and added it to your pipeline only to discover it is WAY slower than it should be? Did you know some pandas functions are written in cython and work much faster than others? Python is not known to be the fastest language, but you can be proactive about speeding things up!
While Python is featureful and simple to write, it isn’t known as a fast language. Many of Python’s functions and one-liners have hidden complexity costs. Choosing the wrong ones can slow down your code and those costs definitely add up. For example - popping an item from the end of a list vs from the start of it, or using the “in” operator on a set vs a list. A few seconds running time difference in your local script could mean a few hours time difference on the production pipeline.
If you want to learn about what to look out for, how to overcome these pitfalls and make your code more efficient this is the talk for you.
asyncio, Python's concurrent I/O library, can power very-high-performance applications. Come and hear the story of how we were able to replace a legacy service cluster with a single asyncio-powered instance, and how you can do it too.
Modern services must handle vast amounts of traffic efficiently and in a scalable manner. One method of achieving high throughput while keeping things simple is by utilizing concurrent I/O.
The asyncio package is Python's offering for building high-performance, concurrent applications using non-blocking I/O. It is also known as an event loop or async/await (among other names), but in essence, it's a useful method for achieving high concurrency efficiently – one that differs from the principles of multithreading and offers unique benefits.
In this talk, I will share the story of why we designed an asyncio-based Python service, how its performance exceeded that of the Java service it replaced by an order-of-magnitude, and what learnings we gained from it. These learnings can help us design super-fast, highly concurrent services.
We will talk about: - The principle behind asyncio's efficiency - its secret sauce. - When asyncio shines, and when you might opt for a different approach. - How to combine it with other paradigms to maximize your application's performance.
We all "import" modules . But how does Python find and load modules, and making their definitions available? The answer is surprisingly complex. This talk walks you through the world of module importation, from load path, to finders and loaders.
Modules are a key feature of Python, allowing us to easily reuse our own code and take advantage of publicly available modules from PyPI. It's a rare program that doesn't include at least one "import" statement. But what actually happens when we import a module? How does Python find our file? How does it decide whether it should even try to find our module? And after it finds our module file, how does Python load it into memory, assigning to its attributes?
In this talk, I'll walk you through what happens when you "import" a module into Python. The mechanism is surprisingly complex, in no small part because it has to take so many possibilities into consideration. We'll talk about finders and loaders, and about the many ways in which you can customize the module-loading mechanism if you find a need to do so.
If you've ever imported a module, then this talk will pull back the curtain a bit, helping you to understand what's happening under the hood.
Does your service architecture slow you down? Instead of enabling rapid and frequent deliveries? If you have found yourself in this situation, you will benefit from hearing about our journey towards an efficient repository structure.
In this session, we will share our journey towards a monorepo. We will discuss the many reasons we made this decision, and how it improved our development. Learning from our experience will help you choose the correct repository structure for your project and needs. We will get into details about the python specifics of a monorepo: what is a good python repository structure, how to utilize python relative imports, editable installation, and how to handle virtual environments. Getting practical, we will present our implementation of monorepo in Gitlab and how we transferred the repositories. We will dive into the technical details sharing commands and snippets.
At the end of this session, you will be able to choose the correct repository structure for your project and will have gained practical tools to implement a monorepo and conduct a safe transfer.
Applications today are giant meshes of services and interconnected APIs. However, there isn’t a standardized, systematic way to integrate them. In this talk, we'll cover the patterns of working with 3rd party integrations.
Applications today are giant meshes of services and interconnected APIs. We benefit from an ecosystem of rich SaaS and 3rd party integrations that are the de facto standard for many platforms and applications: Slack & Jira for example. However, there isn’t a standardized, systematic way to integrate them.
In this talk, we'll cover the patterns and antipatterns of working with 3rd party integrations, and suggest how to wrap them into a very Pythonic framework that focuses on ease of use and great developer experience - using widely used frameworks such as Typer, Pydantic and methodologies like Dependency Injection, Snapshot testing & JSONSchema.
At the end of the talk, the audience will understand the problem of many 3rd party integrations, how existing products solved that problem and how to develop an infrastructure that lets you develop these integrations faster.
Talk structure: The problems with many integrations Use-case: How Airbyte dealt with many integrations and created a community out of it How to create a framework that let you integrate new SaaS easily using: * Pydantic * Typer * FastAPI * Decorators & Dependency Injections How to improve testing for integrations: * Snapshot testing using module patching
Python gives developers multiple tools and best practices to avoid common security issues and vulnerabilities. However, real life requirements, obstacles and deadlines can sometimes cause good developers to produce insecure code that is vulnerable to common OWASP top 10 attacks like Authorization Bypass, SQL Injection and Cross Site Scripting (XSS).
This presentation shows examples based on real-life vulnerabilities we encounter at CYE in our everyday penetration testing of our clients, with vulnerable code examples and mitigations.
Presentation outline: Attack and threats - OWASP Top 10 * Parameter Tampering * SQL Injections * XSS\PXSS * Malicious File Upload
Mitigations * Parameterized Queries and ORM * Hardening: Authorizations + Views * Authorization and permission checks * Input Validation * Output HTML Encoding
When long running jobs are too long running jobs, profilers help us understand where it is that our code spends its time. I present a technique for manually guided profiling for cases the automatic tools cant' help.
Automatic profiling is great. You just run your code, as you normally do, and get a nice graph of where your CPU spends its time while you're waiting for a job to finish.
Except, sometimes the automatic tools can't help. Maybe something in your work load doesn't agree with them. Maybe they make your already long running job run so much longer that it's impractical to run it properly. Maybe you're only interested in profiling a small part of your code, and profiling the whole thing would create too much noise to be useful.
In this lecture I'll go over what I did when faced with such a problem. I'll detail the technique I used to determine where the time is spent. This is a manually guided profiling, i.e. the programmer decides which areas to measure.
We'll also handle the more complicated cases. In particular: * Short functions that get called a lot. * Preventing double accounting when one measured function calls another measured function. * How to present your data when you need to "sell" the need to fix a problem.
Last, but not least, I'll present an easy way for you to incorporate this technique into your own Python code.
Running millions of tasks is difficult when dealing with high workload variance. We improved our pipeline efficiency by using ML models to classify task requirements and dynamically allocate the necessary system resources.
To process our customers' data, Singular's data pipeline fetches and enriches data from dozens of different data sources multiple times a day.
The pipeline consists of hundreds of thousands of daily tasks, each with a different processing time and resource requirements, depending on the customer's size and business needs. We deal with this scale by using Celery and Kubernetes as our tasks infrastructure. This lets us allocate dedicated workers and queues to each type of task based on its requirements.
Originally, our task requirements, required workers, and resources were all configured manually. As our customer base grew, we noticed that heavier and longer tasks were grabbing all the resources and causing unacceptable queues in our pipeline. Moreover, some of the heavier tasks required significantly more memory, leading to Out-Of-Memory kills and infrastructure issues.
If we could classify tasks by their heaviness and how long they were going to take, we could have segregated tasks in Celery based on their expected duration and memory requirements and thus minimized interruptions to the rest of the pipeline. However, the variance in the size and granularity of the fetched data made it impossible to classify if a task was about to take one minute or one hour.
Our challenge was: how do we categorize these tasks, accurately and automatically? To solve the issue we implemented a machine-learning model that could learn to predict the expected duration and memory usage of a given task. Using Celery’s advanced task routing capabilities, we could then dynamically configure different task queues based on the model's prediction.
This raised another challenge - how could we use the classified queues in the best way? We could have chosen to once again configure workers statically to consume from each queue. However, we felt this approach would be inadequate at scale. We decided to make use of Kubernetes’ vertical and horizontal autoscaling capabilities to dynamically allocate workers for each classified queue based on its length. This improved our ability to respond to pipeline load automatically, increasing performance and availability. Additionally, we were able to deploy shorter-lived workers on AWS Spot instances, giving us higher performance while lowering cloud costs.
Hello wait you talk see to can’t my! Sounds weird? Detecting abnormal sequences is a common problem. Join my talk to see how this problem involves Bert, Word2vec & Autoencoders (in python), and how you can apply it to information security problems
Dealing with sequences can be challenging as each item has its unique position in the sequence and there’s a correlation between all items and their positions. One of the most common issues when working with sequences is dealing with anomalous sequences, that doesn’t fit with the regular sequences’ structure. Those sequences make no sense, create noise in the data and interrupt the learning process.
The most common sequences are text sentences, and possible scenario for abnormal text sequences could be when trying to translate sound to text, and sometimes there’re some irrelevant noise that can translate to nonsense sequences, and if we want to build a model based on that data, we need to find a way to identify and clean this irrelevant and anomalous data.
Detecting anomalous sequences could be also related to non-text sequences, such as sequence of action or events. Those scenarios could be related to information security problems. For example, in many organizations there’re logs of actions that has been made on internal systems and detecting suspicious sequences of actions on the system could be a crucial in detecting attacks or misusage of the systems.
My proposed solution includes two phases. In the first step, we need to model the items in the sequence and understand its structure and correlations. We need to train a word embedding algorithm for generating the vectors embedding out of the sequences, such as Bert or Word2vec. The next step, after creating the sequence embedding, is detecting the anomalies. The algorithm we used for the anomaly detection phase is Autoencoder, where you can train the model on normal data and detect the abnormal events.
This pipeline has some challenges. For example, each sequence has different length and there’s a need for training both the word embedding algorithm and the Autoencoder to know how to learn the right structure of all possible lengths.
Join my talk to see how I used python for building this architecture, and learn how you can process your sequences, using word embedding algorithms such as Bert and Word2vec and use their output for Autoencoders in order to create an anomaly detection model for detecting suspicious sequences.
Join us to learn how to use large language models to solve NLP tasks. Via live coding, we'll demonstrate how to use Few Shot Learning together with Multi-armed Bandit, to tackle the boolean question answering task.
In the era of massive language models (LMs), solving NLP tasks can be as easy as specifying a product need to your engineering team: all you need to do is specify your need in a language the LM can understand. One of the main paradigms in nowadays massive LMs is called Few Shot Learning, where one can specify a set of examples from which the model has to understand the task. This approach can sometimes be as effective as finetuning.
But how do you choose the set of examples to show to the model? Randomly choose them? Try all possible combinations and choose the best one? We propose to formulate this task as a Multi-Armed Bandit problem: There are many possible sets of examples, and we’d like to explore and find the optimal in an efficient way.
In this session we’ll begin with an empty Jupyter Notebook and finish with a complete notebook that tackles the BoolQ task (boolean question answering). This live coding session will be paired with practical advices and insights you can apply to your next NLP task using the Few Shot Learning approach.
Rapid development of complex algorithms requires an agile management. This talk will demonstrate how we leverage Python flexibility and DAGs power to enable a flexible algorithm development process with high quality and minimal risk at each stage.
Rapid development of complex and varied algorithms requires tight and agile management. This talk will demonstrate the power of DAGs (Direct Acyclic Graphs) to enable a flexible algorithm development process with high quality and minimal risk at each stage in a continuous delivery ecosystem. I will show how Python concepts and libraries are used in tandem with DAG architecture to manage independent, atomic algorithmic units with unique characteristics — and at the same time to easily create complex flows through flexible combinational integrations of algorithmic building blocks. Python-based DAG is central in creating a strong yet open and simple infrastructure for our SW architecture, that supports incremental deliveries for the development of new high-quality algorithmic solutions according to changing requirements. The semiconductor industry is a dynamic field experiencing rapid growth, requiring the continuous delivery of algorithmic solutions to diverse challenges. While these solutions are inherently complex, they need to be developed quickly and agilely. Our DAG-based SW architecture is perfect for meeting these needs. DAG enables us to develop and enrich our application in both new algorithm units — independent Python packages with their unique parameters, tests etc. — and new algorithmic flows, while improving reuse and throughput without interfering with parallel development. Moreover, DAG provides the flexibility to update any existing algorithmic unit with new requirements, according to changing customer challenges. This is optimally done through the combination of DAG and Python, which inherently enables algorithmic building blocks to act as independent packages and yet remain accessible to debugging through the applications that consume it. And since Python is object-based, we leverage it within the DAG architecture to enhance building blocks with classes, methods, attributes and more. Each node in the graph activates an independent algorithmic unit and delivers all required inputs and parameters. DAG combines these units to generate different flows for different cases and requirements. Over time, each graph section can be developed, replaced or updated with minimal effect on other algorithmic units and flow. We will learn that: 1. Python concepts and libraries (like Tickle and Dask) are the perfect drivers for a DAG-based SW architecture. 2. DAG as a SW architecture enables a flexible development process with high quality and minimal risk at each stage. 3. DAG as a SW architecture enables the development of a variety of different algorithms that are both independent and collaborative, thereby supporting reuse and saving TPT.
Wordle is an online word game that has gone viral with millions of daily players world-wide. We will consider strategies based on information theory and reinforcement learning, allowing the creation of agents outperforming most human Wordle players.
Have you seen the posts on social media featuring yellow, green and gray boxes? Yes, that’s Wordle, a simple online word game that has gone viral with millions of daily players world-wide.
Though being a simple game, naive automatic solutions do not provide a winning strategy for the game. Following the success of machine learning solvers in games like chess and go, we will dive into Wordle and demonstrate how to program python agents that outperform most human players.
We will implement a strategy based on information theory and a strategy based on reinforcement learning. We will present a Wordle python package for evaluating our agents, which you can later use for evaluating and comparing your own agent.
Finally, we address the question all players are asking: What is the best starter word?
How we map continents at cm level accuracy from crowd sourced computer vision data using PySpark. A tale of engineering challenges working with python at huge scale in production with a rapidly evolving development effort.
REM group in Mobileye is tasked with the challenge of creating and updating a high definition map at world scale with cm level accuracy of all road geometry and semantic elements to enable fully autonomous driving.
The map is constructed from crowd sourced anonymized data of millions of driving-assistance systems running computer vision processes in consumer vehicles.
This is the tale of the engineering challenges building a python based production solution running cutting edge algorithms efficiently on big data, while supporting a rapid pace development environment.
In this tale we will share how we addressed the need to: - Build maps at huge scale in reasonable time and efficiency - Enable 100+ developers to continuously evolve the technology at a fast pace, run their code on production loads, view and debug their results
We will discuss challenges of working with PySpark as our major computation engine, and some of the solutions we employed to adderess these challenges.
A peek into engineering in Mobileye REM group.
Being able to classify images is at the heart of many recommender systems. In this talk, we will share a simple trick to make the task of building an image classifier as easy as building a standard text classifier.
Building a task-specific image classification solution typically requires leveraging Computer Vision transfer learning techniques. It involves manipulating complex deep learning models, applying non trivial image preprocessing and using expensive hardware. But what if you could leverage existing image meta-data annotations to classify our images?
In this talk we will share a simple trick to make the task of building an image classifier as easy as building a standard text classifier. This reduction simplifies preprocessing and training and it also dramatically reduces the required hardware & computation time. This reduction is made possible by leveraging ready-made computer vision APIs provided by the public cloud vendors. These APIs extract semantic textual labels from images that in turn can be used to build simple, shallow NLP classifiers.
This simple reduction has helped us deliver fast & cheap Python-based image classification models to production and is widely used in Outbrain products.
What if you could take your Python coding skills and use it to affect the physical world around you? Circuit Playground boards allow you to do just that, and in this talk you’ll see how you can turn on the lights using the Python you already know.
As developers, we are used to limiting our creations only to the world displayed on screens. This is your opportunity to learn how to go beyond it. In this talk I am going to introduce you to the basics of Adafruit’s Circuit Playground Bluefruit board. This round electronic circuit board can light up multicoloured LEDs, make sounds, detect touch, colour, temperature, sound, and motion (with no soldering or sewing required!). The board connects to your computer with a Micro USB cable, which allows you to save your Python code and run it instantly. During this talk we will explore together a few exciting sample projects along with their code.
Even if you never tinkered with hardware before, or just recently picked up Python, you’ll be able to run the existing sample code and create your own in no time.
Linters are a great tool that enable developers to create static analysis rules for their code base, and the most popular one in the Python ecosystem is Pylint - and this talk will walk through some of its advanced features
Linters are a great tool that enable developers to create static analysis rules for their code base, and the most popular one in the Python ecosystem is Pylint. While most programmers use pre-built sets of rules baked into their linter of choice, these can also be adapted to custom needs.
Today's linters are highly evolved and make it possible to avoid static code and even to run static analysis checks through the development and CI cycles, but they are even more powerful and few developers take advantage of their many advanced features. With Pylint it is quite easy to create custom rules that can for both general usage––such as library guidelines and even security SAST, through more customized usage like maintaining clarity around internal frameworks, and enforcing organizational guidelines.
Often times Python is chosen as the language of choice due to its suitability for specific tasks such data pipelines, and system engineering, while those who code in the language are not always familiar with the language's underlying fundamentals and patterns. With custom lint rules, you can proactively help your developers write better code in their native IDEs, protect IaC repos through custom lint enforcement on config files, and even have security tools leverage them for manual vulnerability checks. This talk will demonstrate how you can apply all of this to your Python code with Pylint.
Property based tests are a pragmatic way to write better tests with less work.
In this talk, we’ll introduce property based tests and show how they can help you in real-world use-cases.
Automated tests are great. But they’re not free - we all want tests that are good at protecting us from bugs - but to get that, we need to put a lot of work into them.
Property based testing is a technique that saves us a lot of this work. It uses the computer to generate hundreds, or even thousands of test cases - so we don’t have to. This helps us find bugs sooner and more easily, and have more confidence in our code.
This session will point you in the right direction to start using property based tests in your work.
We will explore the technique through Python’s excellent Hypothesis framework and go over the fundamental concepts, basic usage and tooling.
We’ll also get a feeling for the power and variety of real-world use cases, by creating a test that explores a CRUD web application, finding bugs in edge cases we didn’t know.
We’ll finish with pointers and resources to help you get started.
This session might give you the tools to get started with Python and GPT-3.
GPT-3 is a powerful tool that can be used for a variety of natural language processing tasks. Python is a popular language for development, and Ronnie will show you how to use the two together to get the most out of GPT-3.
You'll learn how to set up your development environment, how to train and use GPT-3 models, and how to troubleshoot common issues. Ronnie will also share some tips and tricks for getting the most out of GPT-3.
At the end of the talk, you'll have a better understanding of how to develop with GPT-3 using Python, and you'll be able to apply what you've learned to your own projects.
Communicating and persisting data (and state!) is at the very core of software engineering. That’s where serialization comes in - but getting it right can be quite the challenge. Here's how to make it less so.
Communicating and persisting data (and state!) is at the very core of software engineering. That’s where serialization comes in: transforming a set of objects into a stream of bytes.
There are endless competing standards out there, roughly divided into textual and binary. For binary, Protobuf by Google is the undisputed ‘king of the hill’ and even spawned its own RPC library - gRPC.
Getting serialization right can be quite the challenge, especially when large data sets and big messages are involved. Two of the key challenges are: The options are endless, but the long-term implications of your choices can be hard to predict. Backwards compatibility is paramount, often making changes time-consuming and costly.
In this talk we are going to discuss the real-world techniques and optimizations learned building a high-performance production debugger. We’ll cover everything from the theory of serialization, through a quick introduction to Protobuf, all the way to the nitty-gritty details of how to squeeze every bit of performance under the most demanding conditions.
A short walk through the challenge of finding the fastest NumPy algorithm/way for solving the 8 Queens puzzle. During this walkthrough I will explain the different solutions, the NumPy APIs I’ve been using and their underlying implementation.
NumPy is a powerful library. Understanding its underlying implementation and its APIs is key for achieving fast code.
The 8 Queens puzzle requires placing eight chess queens on an 8×8 chessboard so that no two queens threaten each other. Thus, a solution requires that no two queens share the same row, column, or diagonal. When solving the puzzle, it is possible to use shortcuts that reduce computational requirements. For example, by applying a simple rule that constrains each queen to a single column (or row). Generating permutations further reduces the possibilities to just 40,320 (that is, 8!), which then can be checked only for diagonal attacks.
The talk will explore 5 different ways to use NumPy to sum all the diagonals in a 2 dimensional array — from brute force, i.e. looping over all diagonals, to more advanced APIs (e.g. as_strides.)
During this walkthrough I will explain the different solutions, the NumPy APIs I’ve been using, their underlying implementation, and a few basic NumPy principles like array data structure, broadcasting, fancy indexing and more.
The optimal solution might surprise you.
We will learn that:
- loops in NumPy are a performance enemy.
- There are (almost) always multiple ways to solve a problem in NumPy — you may just need to expand your box of tricks.
>>> 1,1 == 1,1
1, True, 1
This bug led me into a rabbit hole of learning the internals of python's interpreter. This is a story of how python 3.10's structural pattern matching feature changed the way I write code completely
In this talk I'll go over the new structural pattern matching (also known as “match-case”) featured in python 3.10 and present real world examples on when and why it's extremely useful. In this talk, we will break the misconception that match-case is just another “switch-case” for python, and discover the it’s real power. those use cases include:
- creating interactive CLIs
- creating custom flake8 plugins
- validating custom syntax
We pip install packages all day long, but did you consider where it is coming from?
Let's explore Pypi, the python package index. Topics we will cover: 1. What is Pypi? 2. How are packages uploaded and by who? 3. How to protect yourself from various attacks coming from Pypi? 4. Running your own Pypi repositories and mirroring python packages.
Python is known to be expensive in Memory and CPU. However, it does not mean you can't do anything about it. In this talk, we'll learn about Python's memory management, and what you can do today to improve the performance of your Python program.
Python is a dynamically typed and garbage collected language. These traits naturally mean that Python can be less efficient in terms of its CPU performance and its memory utilization. In this talk, we learn about how Python manages its memory, and what you can do about it to improve the performance of your Python program. More specifically, we'll learn about: - Variables vs Names - why Python is different than other programming languages - How Python represents objects, and specifically - the PyObject struct - Python's two methods of garbage collecting - Reference Counting and Tracing - How you can improve the performance of your program - Two case studies from Instagram (Facebook)
Signing in to Twitter using Google, or saving files from an app to the cloud are different applications of auth flows. This talk will show how it works and focus on integrating a flask app with identity providers by applying the relevant flow.
Most users prefer to sign in to apps via trusted identity platforms like Google, rather than managing a separate account per app. As a backend developer, you would probably like to add such a login capability to your app, and more than that- let the user grant the app permission to use his data. In this talk, you will hear about the standard authentication and authorization flows, and how we can integrate our flask app with identity providers - Starting from the basic terms and ending with the different auth flows. I will focus on matching the right flow for our app architecture and explain how the app gets the user's consent to access his data. Although the integration code will be with a specific identity provider, the flow is generic for every standard identity platform.
The Python interpreter plays a critical role in controlling the performance of your code, using a vast variety of optimizations & fast paths for common code patterns and idioms. This talk will walk you through how it can break or worsen performance.
The Python interpreter plays a critical role in controlling the performance of your code, using a vast variety of optimizations & fast paths for common code patterns and idioms. This talk will be a fun interactive session presented through code examples that display an assortment of those optimizations and in which (unexpected) ways they can break, worsening the performance of your Python code; we'll follow by inspecting CPython interpreter's inherent behavior to understand the reason for the breakage. The output and results just might surprise even the most advanced Python coders.
This talk will (partly) be based on some code samples I have contributed to the excellent wtfpython (https://github.com/satwikkansal/wtfpython) focused on Python quirks in general, with a touch of performance. Through these snippets we will learn some performance dos and don'ts, by understanding CPython internals and features under the hood.
What happens when you develop a Python debugger and the latest Python version breaks it? We’ll go through the process of debugging a Python debugger and the methods we used to solve it efficiently.
When a new Python version is released, the great opportunity to add new features to our software comes around. Yet, alongside those features, there’s always an API break which requires us to make undesired changes to our software. Often, the change can be as small and seemingly insignificant as a signature change or sometimes can be as big as shifting from Python 2 to 3.
In rare cases, due to this upgrade, your software would break. You'd then open your favorite debugger and begin debugging until you pinpoint the issue.
But what happens when you yourself develop a debugger… and need to debug your own debugger?
In this talk I will present what I learned from supporting the latest Python (3.10) when creating a debugger. I’ll go through the background of the underlying debugging mechanism in Python, show a real-world example of what happens when an undocumented minor change in CPython interpreter breaks the debugger, and how to successfully find the solution. I’ll share some personal anecdotes of my own journey doing so and the tips and tricks I learned along the way.
With the increasing complexity of modern Python applications and the high cost of running them in the cloud––the need for profiling solutions rises. However, current solutions often times fall short.
With the increasing complexity of modern Python applications and the high cost of running them in the cloud the need for profiling solutions rises, but current solutions often times fall short, and are not equipped for the requirements of many common stacks and environments.
In this talk we will describe how contemporary Python profilers are implemented, discuss and demonstrate the advantages and disadvantages of different implementation approaches in different use cases. This talk will dive into modern profiling with flamegraphs - when and how to use them for the most accurate results, and introduce the Python profilers PyPerf & py-spy and new implementation approaches including PyPerfMap - for a simpler ramp up, alongside more reliable data, that can work on any Linux machine.
We mostly hear about concurrency as a more performant replacement for threads or multi-processing. But the hidden gem of concurrent programming is how sane concurrent code, and how easy it is to reason about shared state.
In this talk we'll be building a toy async-await framework of our own. Building this ourselves allows us to grok concurrency from the ground up, and understand its value when given to us as an existing framework
Always wanted to have your own surveillance state but didn't know where to start?
In this talk we'll cover the first steps on doing face detection and recognition - in Python!
In this talk we'll go over on how to do face detection and recognition in Python and some simple applications of it (both for hobby projects and totalitarian regimes).
We'll use some handy Python libraries for processing images and finding out all faces that appear in them - find out who they are, how old they are in whether they're happy or not (the unhappy ones to be sent to re-education - out of scope for this talk).
Python is dynamically typed. While awesome, even simple statements in a single line can cause series headaches. Running var['python']['3']['11']['a']
produced TypeError: 'NoneType' .. error. Impossible to debug from the Traceback, until 3.11..
A Python line can contains many operations e.g. in numpy -
result = (a + b) @ ( c + d)
If we will get an error, like broadcast error, the traceback will mention the Python line but won't be able to inform whether it originated from a + b or c + d.
Python is compiled to Bytecode (pyc files), which later interpretated to machine level. One Python line generates multiple bytecodes.
We will dive into Python's Bytecode, to understand how tracebacks work and why they were constrained to provide more information when bugs happen.
Those problems are solved in Python 3.11, released few weeks ago as alpha version. With it, the tracebacks now have markers that show exactly which variable caused the error. This feature reduce the use of debugging mode.
I'd like to share the findings from my research where I looked into python packages that wrap vulnerable C code and ship vulnerabilities to the unaware developers Attackers aware of such libs may abuse these components without the developers knowing
The Python ecosystem has PyPI libraries that bundle C code, but if that code has vulnerabilities how would we ever find out? how can we fix them? Until today, this was a common hidden problem and risk that we all accepted (mostly unknowingly). Today it changes. My talk will demonstrate the uncharted attack vector in open source software supply chain - unmanaged code pieces inside our dependencies.
You'll learn about the struggles of managing open source libraries and adopt takeaways from my research findings, where I looked into Python libraries that wrap vulnerable C code, and unknowingly shipped vulnerabilities to unaware and unsuspecting developers. The developers, on their end, may think that they're safe (no reported CVEs) but malicious actors with such knowledge can exploit these seemingly non-vulnerable libraries and compromise systems while flying completely under the radar.
An enriching talk about music theory and analysis using python tools.
Have you ever noticed that, depending on your mood, not every song is suitable for you at a given time? And how about songs that change your mood while listening to them? For instance, Queen’s “Don’t Stop Me Now”, voted as the most feel-good song of all times, seems to make people pretty happy. If you wondered why this happens, this talk is for you!
We will walk through songs’ features that influence our mood based on the acoustic parameters, such as tempo and loudness. We will use signal processing tools written in Python to analyze it. We will also learn about compositional and lyrical features which help us decide if a song is “happy” or “sad” (or a mix of both).
After this talk, you will know some basic terms in signal processing and music. You then will understand, from the perspective of a signal processing engineer, what are the musical features which arouse your feelings.
A plan is what, a schedule is when .it takes both a plan and a schedule to get things done” - Peter Turla.
A plan is what, a schedule is when .it takes both a plan and a schedule to get things done” - Peter Turla. Aligning operational requirements with the needs of your employees is not an easy task. Managers that need to build a work schedule for their teams, need to consider many variables e.g. employees availability and personal preferences, operational needs of the organization, fairness and inclusion. The organization should both provide the service and run smoothly, and keep the employees engaged and satisfied. This challenge is scaled in the medical setting, where different professional staff members needs to attend different patients with their limitations.
In Antidote Health – a telehealth service, we created an automated process, based on Linear Programming methodology, implemented in Python pipeline.
The Project is based on preprocessing the data received from doctors about their availability times, and encoded via CP-SAT model,Pandas and Numpy libraries . Algorithm then runs and provides an optimization of shifts for each doctor given several constrains. The results are piped into the backend of our system, where the scheduling of the patient-doctor appointments is managed. The developed model assists and optimizes the end-to-end process of scheduling in a fast-paced dynamic environment .
By automating the scheduling process, you were able to deliver consistent results, experience fewer mistakes,reduce costs, maximize our doctors productivity , and boost overall doctors and patients satisfaction.
You will be inspired to develop similar solution for your scheduling needs. Remember - if you want to win in the marketplace you must first win in the workplace
Do you ever use data classes in your project? Need to store these data structures for later use? In our talk, we will present how to do it in Python. We will focus on Pydantic and will show the correct way to do it for complex data structures.
Many of us write Python apps that handle data. As part of that, we often face the need of persisting this data as-is. This will enable us to use this data structure later.
In this session, we will present 1. How does Serialization work? -- 1. The difficulties of Dynamic programming languages with serialized data -- 2. A few different Python libraries that handle that (pickle, jsonpickle, pyyaml, marshmallow) -- 3. How Pydantic takes the most precise approach 2. The problem and solution to Serializing subclasses -- 1. The problem when deserializing the raw data -- 2. A live session where we walk through increasingly better solutions to find the optimal one
After this session, you will have the understanding and tools to persist your complex data structures using Pydantic.
A step-by-step introduction to purchase prediction. Also applicable to survival analysis and churn prediction. Including implementation in PySpark.
When dealing with survival analysis, the model's success is predicting death correctly. But it can also predict an engine failure, abandonment, or even purchases. In purchase prediction, survival analysis, or churn prediction, the data is usually labeled or artificially labeled by a set of rules- such as inactivity for 30 days equivalent to churn. But the data structure is different from classical machine learning, and the data handling and modeling are different accordingly. In this lecture, we will cover the data structures and aggregations for such analysis focusing on time aggregations using pyspark and what NLP got to do with any of it.
Nobody cares about your algorithm, learn how to communicate model insights.
Data scientists can lose track of time working on their models, failing to consider anything other than loss and metrics. But, in reality, no one cares about one's specific model. The ability to communicate the model’s functionality as well as other insights produced by the model are critical to the data science profession. However, this ability is often overlooked. However, what if I told you that you could solve this problem and demonstrate your model's capabilities to the rest of the world with a simple Python script?
The talk will include a live demo that will demonstrate how, using a few lines of code, you could wrap a machine-learning model into a pure python web app so others can experiment with it. You will see how to incorporate interactive plotly graphs which gives the user an additional layer of understanding of the problem.
It is said that soft skills are hard currency. In our case, they are backed up with powerful technology. Whether you are a beginner or an experienced data scientist - this talk will level up your visual data communication, and grant you a new superpower!
Formal verification (FV) can prove the correctness of algorithms and systems and so ensure safety. Since FV tools are not easy to use, we will show examples (from the RL domain) of how executing them via Python could be very useful and friendly-user.
Formal verification is a subfield in mathematics/computing that provides methods to specify [1] and formally prove the correctness [2] of algorithms and systems given an environment and described using variables, transition functions, etc.
The ability to check a code, whether it enters a state that is not safe using the information in the system is a great advantage. But unfortunately, the main model checkers (which are automatic tools implementing formal verification algorithms) are not very user-friendly. They require writing a text file, which can be very long for large-scale problems, and multiple steps to compile and run, also their UI isn’t always very accessible.
Utilizing model checkers such as NuSMV [3] can become easier by using python. By writing a code that generates the text file for us and activating the NuSMV as a subprocess, we can get the NuSMV results and by using python disassemble the answer and use it in our code accordingly.
Moreover, if we change the problem parameters or the scale of the problem changes between runs, we need to change the file for the NuSMV accordingly, which makes the transition between runs much more complicated. By using python we can make this process much easier and more modular, we can use the information on our code to write the file and also run it.
To demonstrate the idea, we will focus on reinforcement learning (which is a subfield of machine learning). We trained an agent to play frozen lake - a game in which the player needs to reach from one point to another on a board without falling into certain squares which are water holes. We use Q-learning [4] (which is a reinforcement learning algorithm) for the agent which is trained until the decisions table converged by a change of a bound - epsilon or less. We aim to find a way to improve the learning process using Formal verification and specifically NuSMV.
We used NuSMV as an expert of reinforcement learning, the NuSMV checks if the problem we created is solvable before we run the agent, and also during the training process, we used it as an expert that helps the agent learn and make sure it can find an optimal route to the goal.
Before using it as an expert that gives the agent the solution, we used formal verification and python together. For example, some problems require parameters tunings and so we used the NuSMV to check if the epsilon to convergence is good enough to reach the optimal solution, which has the potential to save time because we might be able to take a much larger epsilon.
Furthermore, NuSMV can tell us if the agent during the training process has enough information to reach the solution or if it needs more training. These principles are beneficial to other RL problems.
To conclude, formal verification can be helpful in many problems but using it may be hard for large-scale or complicated problems, so python is a powerful tool that can help us make the process easier and manageable and allow future integration with new problems and tools.
[1] Manna, Z., Pnueli, A.: The Temporal Logic of Reactive and Concurrent Systems - Specification. Springer, New York (1992). [2] Clarke, E. M., Henzinger, T. A., Veith, H., & Bloem, R. (Eds.): Handbook of model checking (Vol. 10). Cham: Springer (2018). [3] Cimatti, A., Clarke, E., Giunchiglia, F., Roveri, M.: Nusmv: a new symbolic model checker. Int. J. Softw. Tools Technol. Transfer 2(4), 410–425 (2000). [4] Watkins, C. J., & Dayan, P.: Q-learning. Machine learning, 8(3), 279-292 (1992).
Transformer-based models have been producing superior results on various NLP tasks. In this talk, we’ll cover the new transformer-based NLP pipelines featured by SpaCy, and how to apply multi-task learning for improved efficiency and accuracy.
The transformer deep learning model adopts the mechanism of self-attention, and serves as a swiss army knife solution to many of the most common language tasks. Combining several language tasks together is commonly referred to as language processing pipeline, and was made popular across the NLP industry using SpaCy, a Python library for language understanding. In this talk, we’ll cover the undoubtedly most exciting feature of SpaCy v3: the integration of transformer-based pipelines. We’ll learn how to quickly get started with pre-trained transformer models, how to create custom pipeline components, and how to perform multi-task learning and share a single transformer between multiple components. Altogether, we’ll explore the building of efficient NLP pipelines for real-world applications reaching state-of-the-art performance.
Murphy’s Law states that if anything can go wrong it will -- and this is particularly true in data science. Based on personal experience, I describe how to create an effective model despite data pitfalls, methodological hazards and hidden bugs.
“If anything can go wrong, it will”, states Murphy’s Law, and this holds particularly true in data science. Whereas the algorithms used in data science are mathematically flawless, the path to creating an effective model is often rife with obstacles, such as data deficiencies, methodological pitfalls, hidden bugs and human mistakes. In this talk, I offer lessons about common obstacles and how you can avoid them in your projects, as I have learnt from nearly a decade of data science work. Topics include: how data often violates our intuitive assumptions, and how Python tools can detect such violations. How to confidently build a model by starting from simple baselines and using synthetic data to your advantage. How to avoid common pitfalls – such as unintentional overfitting, train set contamination and inconsistent package versions – via defensive programming and code reviews. And how to guarantee long-term code correctness via Pytest.
In this talk, you will learn about the concept and benefits of tracing by examining the open-source project OpenTelemetry. You will leave this session knowing how to set up OpenTelemetry to get better visibility and troubleshoot your system faster.
Tracing and observability are becoming very popular as distributed services are getting more complex. To better understand our architecture and to be able to troubleshoot production issues faster, we need to track how requests are populated throughout the system. By monitoring the interactions between the different components we can overcome some of the native complexity of distributed services. In this talk, you will learn about the concept and benefits of tracing by examining the open-source project OpenTelemetry in its Python version. You will leave this session knowing how to set up OpenTelemetry yourself and utilize open-source solutions and tracing data to get better visibility and troubleshoot your system faster.
Using Python's awesome features and important design principles for safe legacy code refactoring while maintaining a healthy production environment
Throughout the company life cycle many products change and evolve. Your PM is demanding an urgent feature and you find yourself looking at a patchwork project composed of fixes on fixes, no testing and, "if it's working, don't touch it" comment written by someone who left the company a few years ago.
We all know the experience of inheriting a legacy system and with it the curse of tech debt.
In this talk I will share the insights I gained lifting many such curses, while maintaining a running production environment, using the magic of the Open-Close principle. Giving you the right tools to plan and execute, your next refactoring challenge.
Securing Infrastructure-as-code configurations is a key requirement in a cloud production system. We will cover how the networkx library is leveraged to represent cloud resources as a DAG, and how it enhances the misconfigurations scanning process.
We at Bridgecrew by Palo Alto Networks have published checkov, a Python open source tool which is the industry-standard for Infrastructure-as-code (aka: IaC) scanning, and is used by thousands of users and with an active community. Checkov scans for misconfigurations and reports the security implications and risks that such misconfiguration has induced. In this talk, I will show how we leveraged the networkx python library to empower checkov’s scanning capabilities.
IaC is the practice of codifying the provisioning and management of IT resources. As IaC frameworks have become more advanced, dependencies between IaC configurations were incorporated, just as physical IT resources are most likely to be dependent on each other.
Checkov utilizes the networkx graph library to build a DAG (directed acyclic graph) to enable blazing fast graph analysis queries relevant for the IaC security domain. It uses the graph to render configuration settings which stem from other configurations that are linked to it. This allows the testing of more complex scenarios, in which a combination of dependent resources creates an unsecure misconfiguration, whereas scanning each resource independently would not unveil such misconfiguration.
After this lecture you’ll know what a DAG is, how to traverse it and the benefits of in-memory analysis over utilizing a persistent graph DB. You’ll also get to know the concept of IaC, and what benefits it brings to the development lifecycle, and why using checkov ensures your IaC remains continuously secure.
In this talk we'll discuss the finer points of working with JSON. We'll cover custom serialization, validation, and shine some lights at some darker corners.
Why does the following code prints False?
python
outgoing = (1, 2, 3)
data = json.dumps(outgoing)
incoming = json.loads(data)
print(outgoing == incoming)
In this talk, we'll cover more advanced topics in JSON serialization. We'll look into serialization in general, look at some design decisions the Python JSON module does and look into custom serialization and streaming.
Concurrency in web applications is hard to identify and debug, and very easy to get wrong! In this talk I'm going to present common concurrency issues and suggest ways to identify and prevent them!
In this talk I'll present common issues, and different approaches for dealing with concurrency:
- Why it is better to ask for forgiveness instead of permission when dealing with concurrency
- What is the TOCTOU (time-of-check-time-of-use) problem, how it can happen and how to address it
- How database locking can help with concurrency, and how to avoid it!
- How and when to rely on database features to maintain correctness
Arduino microcontrollers can be used for numerous and versatile home-based functions. Want to learn what you can do and how to get started coding Arduino in Python? There’s a low cost of entry, and the possibilities are endless.
Python is not generally considered the best option for programming microcontrollers. It is big, slow, and not optimized for handling asynchronous events. So how can you enjoy the fantastic world of microcontrollers using Python? In this talk, I will share my journey and the transition I made moving from C to Python, using Arduino programming to improve my Pythonic skills.
There are so many possibilities latent in programming Arduino using Python. I will begin by explaining the basics of Arduino and Micro-Python and showing an Arduino project that I created and other open-source Python projects written for Arduino.
This talk aims to inspire Python developers from all levels to dive into the magical world of microcontrollers and get their hands dirty.
Are you curious about how you can enjoy the benefits of your Pythonic skills in the physical world? Would you like to learn how developing an Arduino project can help you improve your Pythonic skills? If you answered yes - this talk is for you! If you know Python – the microcontroller's world is open to you.
Taking the Django traditional groups and permissions to the next level by adding layer and using an access endpoint pattern approach to provide scalability, flexibility and a wider control of authenticated user's access.
Django comes with a built-in permissions system with view, add, change, and delete permissions, as well as the ability to add other permissions as one might wish (even if with weird and uninformative names that no one will eventually use). One question remains, though: is it the right approach when building a big Django application that contains a lot of different authorization groups? This talk is more of a case study dedicated to the challenges (and solutions) we had at Bluevine around this exact question. At Bluevine we faced many difficulties enforcing authorization with the traditional Django authorization system for several reasons. * incompatibilities and different users with different sets of permissions, due to lack of restrictions around the Django auth user model. * Inability to grant limited permissions to different models.
The above are just a few of these difficulties. In this talk, we'll discuss the above and other difficulties, as well as our solutions. Hint - we refreshed the Django permission mechanism by adding a role layer, changing permissions, and more.
Python remains a very popular programming and scripting language in the DevOps ecosystem for building CI/CD pipelines. In the same way we think about how we design and build our Python applications, we need to design, build and automate security.
The minimum viable security (MVS) approach, enables us to easily bake security into our config files, apps, and CI/CD processes with a few simple controls built for Python applications.
In this talk we will focus on five critical security controls that will be integrated as part of the CI/CD pipeline: Bandit for static application security (SAST), Gitleaks to detect hard-coded or insufficiently secured secrets, Python dependency checks (SCA), infrastructure as code (IaC) and ZAP for API and dynamic application security (DAST), in addition to custom controls to ensure proper enforcement of MFA via Github Security. These controls will provide a foundational framework for securing Python applications, from the first line of code, that will make it possible to continuously iterate and evolve our security maturity, for advanced layers of security that often comes with time, as well as increased experience.
Code examples will be showcased as part of this session.
Startups choose python because it helps you to set up your application very quickly, but after a few months, it can get really messy. This is the story of how we in Databand managed to extract good backend guidelines with Python out of our monolith.
In this lecture, I'm going to talk about our experience in Databand.ai where we started with a quick-and-dirty web application architecture in order to quickly provide value to our customers.
Our application became larger and larger, and the business needs were significantly changed. A lot of new code and features were introduced, and we found ourselves chasing after our tails in order to detect problems in our app. When we had enough, we started to conduct new backend guidelines and embraced a lot of DDD methodologies such as a bounded context, layering, unit of work, repository pattern, domain architecture, and so on.
I believe this tells the story of a lot of startups and there is a lot to learn from our experience - what went wrong, how to avoid it, and how to fix old problems.