PyCon Israel 2021
Schedule
Beyond basic algorithmic considerations when writing our code, you would be surprised how easy it is to get more than 100X increase in efficiency with less than 30 minutes of work without even improving the time complexity.
When operating on big arrays we often fall into old habits of code writing, be it using pandas, numpy or vanilla python. While these habits may optimize the speed at which we write code, they often fall short of the optimal code for run-time. Even saving milliseconds of run time per task can accumulate to staggering amounts. Sometimes despite having very similar syntax between functions and packages there is a huge difference in performance, since the internal workings of pandas, numpy and python varies, as each balances the overhead (or “init”) and marginal cost differently. We will explore common and run-time costly pitfalls when using pandas and numpy and we will see when it is more efficient to use vanilla python compared to these packages. I will introduce a profiling method and a timing method. Working with both together can help us detect the weakest points in our code, and quickly test different options for improving it. One of the main points of the talk is how to come up with many code variations and test them quickly to come up with the best solution. I will present many often-neglected functions from these packages or native, and experiment to see when each is more efficient. E.g. using pandas index vs dict and itemgetter; np, pd or py isin methods; apply vs map; concatenating and appending data to arrays; and many more. In addition, we will learn of some useful and surprising efficiency tricks and data-structures like sparse matrix, numpy array to replace a dict, clever ways of using memorization and more.
Software and algorithm teams have different needs, still, python can become a common language and satisfy the needs of both teams. We will see how python as a common language boosts our development process.
Can algorithm and software teams talk the same language? For many years in Applied Materials, the answer was – No! the teams have different needs... While algorithm team preferred to develop their algorithms in MATLAB which has rich arsenal of scientific tools, software team preferred to write their code in C++ in order to gain code efficiency. The code conversion between the teams was long and exhausting process. In this lecture I will describe our decision to have python as a common language for both algorithm and software. I will explain how python can fulfill the needs for both algorithm development and software standards of design and efficiency. I will describe our joined development process in order to achieve algorithm and software goals through the language. I will show you the boost this process gave us which will convince you to choose python as a common language in your company too!
JupyterLab does not have state management as do other commonly used frontend frameworks. This is needed to create multi-page applications with connected forms and shared data. We solved this by developing a a custom solution, which we will present.
We implemented a multi-pages python application over JupyterLab in order to utilize Jupyter’s data visualization and UI capabilities. Our application requires the various pages to be aware of each other's data and get updates when it’s changed. We could not find a simple package for state management in python as other commonly used front-end frameworks have. Therefore, we implemented a simple state management package.
Our state has a dictionary interface, which enables the application pages to: * insert keys with any type of data, * register actions to change these keys * register to receive updates when a specific key is changed
We will share our state management code and show live code examples of how to use it in python applications.
In K we had a simple task: build a chatbot. With a lot of logical paths. And loops. And external interrupts. In this talk I will present our fairly exotic solution, that looks like a resumable function, which is persisted across user requests.
Source code for the examples is available on GitHub.
Slides are available here.
I will be presenting the "dialogs framework", which a library we build in-house for managing long-running persistent functions. These functions come back to life when they get a message from the user, magically resuming where they left off. I will go over the design process, the Python implementation details, and the current gaps and challenges.
The aim of this talk is to inspire similar projects, show off our framework, and reach out for community inspiration on our open challenges.
As motivation, here is the kind of code we are writing:
@dialog
def greet():
name = run(prompt("Hi, I'm a bot. What's your name?"))
location = run(prompt(f"{name} is a beautiful name! Where are you from?"))
run(send(f"{location}? No kidding! I grew up there!"))
After each call to run, the function terminates and the user request is answered with the next message to display. The execution state is persisted into our database, allowing the follow-up answer to be handled by a different server in the future.
Loading python code from a remote location during runtime opens new world of opportunities (and challenges).
As we all know, python is a very versatile language. You can do a lot with python, from ‘scripting’ to ‘software development’, it can be interpreted and also compiled as most of us know .pyc files and some of us know .pyd files.
Another thing that can be done using python is ‘webImport’ or ‘http import’, it means importing a python module from the web. Using code that comes from a remote source can open us a new domain of possibilities, and a totally new domain of problems, risks and challenges.
In JPMorgan we have the Athena project, it has a huge python code base, probably the largest in the world. The athena ‘special’ python interpreter imports code from an object store and enables us to do many things:
Pros: · A very fast way to update many services out there, as you just need to update the “public location” · You do not have to create ‘large updates’ and push all the changes coming from different teams to production, if one developer is done, he can just push his own code to production without waiting for the ‘release time’. Even though the code base is large, updates are very fast
· Code could be updated after the artifact\container is ‘sealed’ · The code base\program can be very large, ignoring the size of the artifact · Fixes and pushes to production can be very fast · New tests can check the code computability of both the coming and the previous releases (for an instance, on git, if you store the tests in the same repo as the code, it will be very hard for you to write a test case, show that’s is failing on the released version, and prove that it passes on the fix you made)
Cons: · Requires investment · Loading modules can sometimes be slower than loading from disk · Need to find a solution for code inconsistencies · Code could be updated after the binaries\container is ‘sealed’ (this might change the behavior) · Requires good testing it order to be stable
On my talk ill discuss this and maybe more
In Python, we normally don't worry about memory usage. But that doesn't mean memory leaks are impossible! In this talk, I'll introduce "weak references" -- how they work, when you would use them, and tricks to get the most out of them.
One of the great things about Python is that it includes garbage collection. You don't have to allocate or free memory; just let the system take care of things on its own! In theory, that means you can never experience memory leaks. But in practice, that's not quite true: There are definitely ways in which you can accidentally hold onto object references, resulting in a memory leak.
Fortunately, Python provides us with "weak references" in the standard library's "weakref" module. In this talk, I'll describe Python's garbage collector, and how we can end up with memory problems despite it. I'll then show you how the "weakref" module can help us -- both on its own, and with the data structures and functionality that the "weakref" module provides.
Even if you don't need weak references, knowing how they work can give you great insight into Python's internals, and how you can take advantage of them in your work.
Most programmers consider Python as a scripting or a server side language totally unsuitable for UI . In Imubit we decided to use Jupyter Lab in order to combine Python's powerful server side abilities with a beautiful UI.
While Jupyter is widely used for big data or data science, we decided to use it to easily develop a streamlined work process for our engineers. With just a small amount of effort, we were able to create beautiful, easy to use Python user facing applications for non-technical users. We will start by describing Jupyter Lab Extensions, then we will get a glimpse of some of Python’s frontend packages (Ipywidgets, panel, Ipyaggrid etc.), and learn how to use them and expand them with our own custom logic.
We introduce a very useful tool called "vmn" for auto increasing your application's version number in an agnostic way to language or architecture. You will learn how to use vmn for your application and how to integrate it to existing CI/CD procedures
Link: https://github.com/final-israel/vmn
Problem statement: Today there is no standard way for increasing application's version, retrieving it or going to a specific version with the exact dependency it was once released with. These are the issues vmn tries to solve. In this talk we will include different real world use cases and will invite others to collaborate and get involved in the project's development.
Walk aways: The attendees will learn how to stamp their current applications with vmn
The time has come for almost every Python developer to build new applications following the serverless paradigm. This talk is 300 level describing the most important principles of serverless application architecture.
We shall learn the main benefits of micro-micro services as well as the main challenges when building this kind of applications. As a bonus: some ways to deal with these challenges and several common serverless architecture patterns.
Agenda - Why Serverless? (15 min) - Limits. Why they exist and how to fit them. (5 min) - How to orchestrate. (5 min)
Python's warnings are exceptions — but they're also distinct from exceptions, and are both used and trapped differently. In this talk, I'll introduce warnings, how to raise, trap, and redirect them, and show you best practices for their use.
If your code encounters a big problem, then you probably want to raise an exception. But what should your code do if it finds a small problem, one that shouldn't be ignored, but that doesn't merit an exception? Python's answer to this question is warnings.
In this talk, I'll introduce Python's warnings, close cousins to exceptions but still distinct from them. We'll see how you can generate warnings, and what happens when you do. But then we'll dig deeper, looking at how you can filter and redirect warnings, telling Python which types of warnings you want to see, and which you want to hide. We'll also see how you can get truly fancy, turning some warnings into (potentially fatal) exceptions and handling certain types with custom callback functions.
After this talk, you'll be able to take advantage of Python's warning system, letting your users know when something is wrong without having to choose between "print" and a full-blown exception.
When we first developed our system, we picked Celery due to its wide community adoption. When we started scaling our systems, we realized Celery was pulling us back from many different angles. We decided to replace Celery with our own technology.
Back in the days at Intsights, we architectured our platform based on a distributed task queue approach. Looking for an available library and devices to support our approach, we met Celery. According to Celery's documentation, Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages while providing operations with the tools required to maintain such a system. It's a task queue with a focus on real-time processing while also supporting task scheduling.
For Intsights, Celery did not live to its promise. It did not scale and was highly bloated with metrics and communication overhead. Nonetheless, Celery did not introduce enough thread/process safety to handle problematic workloads that might fail, crash or get stuck on special occasions, such as a stuck GIL due to an infinite Regex.
At some point, we realized that our only option is to develop our own solution. We decided to stop chasing Celery bugs and be focused on what fits Intsights best. At first, we were inspired a lot by Celery's design. We implemented result backends, timeouts, and pipelines. We stuck to Celery's terminology to make the migration easier. Later we ditched most practices and introduced our own.
Today, we have a high performant, highly stable, and safe library that supports our use case in a perfect manner. Sergeant is meant to be very simple, very fast, very stable, and safe. Still, many features are missing or left out. We only support Mongo and Redis as backends. We do not guarantee consistency of task order and consumption. These compromises let us stay very simple to maintain and to focus on stability and performance. The library supports only Python 3.6> and provides full type annotations and test coverage.
When is it the right time to implement security when building an app? In this talk, you will learn how to build from scratch a secure Python application hosted in the cloud, the major attack vectors and tools you need to remediate to the main risks.
When do you think is the right moment to worry about the security of the application you develop in the cloud? The first time your customer requires it to buy your product, or should you just wait for the first security incident? I strongly believe that it is never too early to think and act towards securing your environment and your product, otherwise security becomes some unmanaged technical debt that just accumulates with every single line of code.
In this talk, you will learn how to build a secure Python application hosted in the cloud from scratch. You will discover what are the main common threats and attack vectors you need to fight, the tools you need to leverage to remediate to the most critical risks and how to continuously monitor the security of your application and environment.
While many developers struggle with the question, “should or shouldn’t I use python annotations?”. I would demonstrate how proper usage of python annotations guide the developers to refine the structure of the written code.
In the “The Clean Architecture” article, Uncle Bob explains the fundamentals principles of “clean” and fine coding. In this lecture, I would show how using python annotations helps the developer to follow those rules. Python annotations are considered by many as redundant or nice to have, yet I will demonstrate how Python annotations, along with making the code more readable also enhances the chosen programming structure. I will start the session with a brief introduction to what is considered, by uncle Bob, to be a “clean coding architecture” and continue with practical day to day examples.
Programming requires a logical mindset, which can be used to introduce strategy into your daily life. Join me, as we review pythonic best practices, constructs and concepts and see how to take advantage of them both at and away from the keyboard.
Most people think of programming as a technological medium to accomplish a task. While this is definitely true, there is a lot more that you can get out of python than for loops and dictionaries. Thinking like a python developer enables you not only to code better, but also to incorporate programming best practices and strategies in your daily life. It also doesn't matter what type of application or script you are writing, by learning python concepts, constructs and best practices, you will be able to take full advantage of what the language has to offer.
Implementing a Flask realtime web application for production isn’t as easy as it seems. Learn how to use Redis Pub/Sub, Ngnix, uWSGI, signaling, unix socket, mule process, socket.io and more to create a robust realtime app.
socket.io enables real-time, bidirectional, event-based communication between the browser and the server. Ideally Pythonists running a Flask application would simply use the Flask-SocketIO library, yet Flask alone is not suitable for production and must be hosted by a real web server. Thus requiring additional development to enable the usage of socket.io.
Our Framework consists of a uWSGI server running Flask instances and other services behind a Nginx proxy. We will share a full working solution of a framework setup that supports Flask realtime web application in production.
The framework includes Redis Pub/Sub to publish events from any service, a mule service to listen to events, uWSGI signaling to notify all workers, socket.io on a Redis backend to allow lazy-apps, a new Nginx mapping and another http listener in the uWSGI server.
In recent year Mypy gained wide spread adoption, and as it continues to improve and evolve, more and more useful features are being added.
In this talk I'll preset some gems in the type system you can use to make your code better and safer!
The Mypy typing system, and the complementary extensions module, includes some powerful but lesser known features such immutable types, typed dicts, union types and exhaustiveness checking. Using these advanced features, developers can declare more accurate types, get better warnings, produce better code and be more productive.
In the talk I'm going to demonstrate the following:
Basic type annotations (primitives, collections types, etc.)
How the syntax evolved from Python 2 comments to new features planned for Python 3.10
Using TypedDict and dataclasses for readability
Immutable types (e.g List / Dict vs Sequence / Mapping), examples and motivation
Type narrowing, and how it can be used to achieve exhaustiveness checking
To demonstrate these topics I'll follow along an example of a real system, where in each step I present a problem and demonstrate how it can be solved using Mypy.
This talk might give you what you need to secure your python application from OWASP top 10 vulnerabilities. We’ll look at examples, tools and quick tips for a more robust code base.
In this hands-on talk, Ronnie Sheer, Head of R&D Hiverr(a Team8 startup) walks through real examples of OWASP top 10 Web Application Security Risks in Python applications. We will then look at small changes you may introduce to your codebase right away to make it more robust. Finally, you may start leveraging OWASP top ten to create a culture of secure coding. Securing Python applications can be an overwhelming task. Leveraging OWASP top ten is a great starting point.
This talk will describe the monorepo codebase architecture, explain why you might want to use it for your Python code, and what kind of tooling you need to work effectively in it.
As organizations and repos grow, we have to choose how to manage codebases in a scalable way. We have two architectural alternatives:
- Multirepo: split the codebase into increasing numbers of small repos, along team or project boundaries.
- Monorepo: Maintain one large repository containing code for many projects and libraries, with multiple teams collaborating across it.
In this talk we'll discuss the pros and cons of monorepos for Python codebases, and the kinds of tooling and processes we can use to make working in a Python monorepo effective.
Developers use the term "mocks" and "mocking" when referring to several different testing practices. The talk will standardize the terminology of Mocks, Stubs, and Fakes: their capabilities, the differences between them, and when to use each one.
We'll start with defining Mocks, Stubs, and Fakes: their capabilities, the differences between them, and how to code them.
Next, I'll show an iterative code example where we mock the same code using the above three methods. Every time we mock the same code differently, we get to test different things.
The talk summary would be based on the code examples: covering every mock's strengths and weaknesses, when is it appropriate to use each type of testing method, and a general wrap-up.
If you want to test your code effectively, this talk is definitely for you!
Modern distributed software doesn't stop at your VPC. Edge deployed software needs realtime communications, updates, and state sync. It needs RPC and PubSub over the web. Lets make it open-source.
In this talk we'll cover the need for over-the-web realtime RPC and PubSub, why we needed and created it for our OpenPolicyAgent realtime updates layer, along side: - The challenges that the implementation faced - Pro/Cons of realtime update channels - Common use cases (updates, sync, event propagation, distributed computing, authorization, ...) - Additional awesome Python open-source we used in this solution (FastApi, Tenacity, broadcaster, ...) - How to use the open-source packages we shared.
At Bluevine we use Airflow to drive our all "offline" processing. In this talk, I'll present the challenges and opportunities we had by transitioning from servers running Python scripts with cron to a full blown Airflow setup.
In Bluevine, we were looking to upgrade our backend processing infrastructure from servers running Python scripts with Cron to a more scalable solution that allows for workflows (DAGs) and better observability of the application state. Airflow proved to be a valuable tool, though not without some sharp edges. Some of the points that I'll cover are:
- Supporting multiple Python versions
- Event driven DAGs
- Airflow Performance issues and how we circumvented them
- Building Airflow plugins to enhance observability
- Monitoring Airflow using Grafana
- CI for Airflow DAGs (super useful!)
- Patching Airflow scheduler
FastAPI is a modern, high-performance, batteries-included Python web framework that's perfect for building RESTful APIs. It can handle both synchronous and asynchronous requests
With Python2 deprecated and the rise of Python3, a new world of features and projects have opened up. FastAPI is one of those projects. Heavily inspired by Flask, FastAPI has a lightweight microframework feel with support for Flask-like route decorators. This means that moving from Flask to FastAPI is easy. It takes advantage of Python type hints for parameter declaration which enables data validation (utilizing Pydantic - another great Python3 project) and OpenAPI/Swagger documentation. It's super fast. Since async is much more efficient than the traditional synchronous threading model, it can compete with Node and Go with regards to performance. In addition, it uses uvicorn - the async lightning-fast answer to gunicorn.
In this talk I will introduce you to FastAPI, why we at Insidepacket chose to use it as our web framework and how to migrate from Flask to FastAPI
In this talk we'll use a real-life use case to learn how extending GDB with Python can help us to solve bugs, all while digging deep into the internals of Python locks and how they're implemented.
Debugging deadlocks is hard. Debugging deadlock in production is even harder. This talk will demonstrate how Python’s state can be debugged in production using GDB and how can we easily add it to our debug toolkit. There's much we can do with extending GDB with Python to better understand the internals of the language, and even to customize it to our own debugging needs. We'll learn about CPython's locks, how they affect us, and how to debug a multithreaded Python process in real-time using GDB.
Handling high cardinality with big data can be challenging. We improved our pipeline speed and stability by understanding which data matters more and creating a smart “Cardinality Protector” to reduce cardinality with minimal effect on the data.
As a marketing analytics platform, Singular handles and ingests billions of user events on a daily basis, along with all the marketing data pertaining to each event: Was an ad clicked? When and where? Which network served that ad? How much did the ad cost? And much more. The data is then aggregated so our customers can use it to make informed decisions in their daily marketing operations.
As our operations scaled, we have experienced cases where the sheer number of events, with the large number of columns saved per event, some of which have high cardinality, slowed down our data ingestion pipeline. It ate up CPU, memory and network resources to the point of affecting the user experience. The burden on the system was exacerbated by click spam: a type of fraud where automated tools simulate millions of ad clicks. Click spam increases the already high load on our pipeline.
Our challenge was to reduce the amount of data we ingest, improve our pipeline speed and stability and provide a better overall user experience. But we couldn't just remove excess rows as all rows are essential -- including ones that represent possibly-fraudulent clicks. Our customers want to measure click spam activity and find out where it originates. However, is it possible to retain the necessary information but still reduce the cardinality of some columns?
This was the starting point for what became the "Cardinality Protector." In-depth research into our data helped us prioritize all the columns and metrics by their importance to customers. We then created smart rules in order to cut out some of the most extreme cardinality with minimal effect on the data.
In this session, we will show how we applied our cardinality protection logic to improve system performance significantly while minimizing the effect on the data. We'll talk about the challenges we ran into, both in terms of prioritization logic and system resources, and unveil some of the cool tricks we used, with Pandas and on-disk sorting/group-by, to apply cardinality protection to large batches of data.
An inside look at some of the tools inside Sanic to help build a background task manager.
You are building an API when, inevitably, you realize that certain requests are super slow to respond. It dawns on you that you really need to push some work off to a background process. How should you do it?
We will explore some of the tools that exist inside the Sanic framework that will enable us to do just this. From the simple task, to complex multi-node cluster: we will look at different strategies to determine the most appropriate tool for the job. Think celery, except entirely within Sanic.
An introduction to Geographic Data, some of its basic concepts and common Python tools for working with it.
At some point in life there might come a time where you might need to work with geographic data. Instead of waiting in dread for that day, it's best to be prepared!
In my talk I'll explain a bit about geo-data types, formats and concepts, and existing Python tools to easily work with it - all while working through a fun and quirky sample case that we'll solve together.
I'll present a tiered approach that allows testing microservices quickly and thoroughly. The tests use stateful mocks of other services, and thus allow concise tests as well as simulating outages, subtle timing problems and large datasets.
Microservices are fantastic, but a pain to test in complex interaction scenarios. Unit tests are quick and easy, but don’t cover interactions. System integration tests are a standard way to address complexity, but take a huge effort to maintain and a lot of resources to run. How can we get the best of both worlds?
In this talk, I’ll present a tiered approach that enables writing tests quickly without sacrificing coverage. The tests use stateful mocks of other services mediated by a verification layer. I’ll talk about how this approach allows testing of insidious failure modes, such as failures within dependencies and narrow race conditions, both of which are almost impossible to achieve in integration tests.
CI/CD is critical for rapid software development, requiring advanced monitoring and logging infrastructure. We will present our PyTest integration with Elasticsearch, leading to significant debug reduction time and infra/product health improvements.
We will present our methods of integrating Python with the Elasticsearch database by using PyTest plugins and other advanced PyTest features. The Python + PyTest infrastructure allows us to gather useful data such as test coverage, infrastructure stability monitors, product health and debug information. We will go over the three different data levels that we are using: the CI/CD Infrastructure, test flow, and validation test coverage. In addition, we will share how this data enables us to achieve a faster, more stable CI/CD, leading to more efficient development and release cycles. Our system is based on Python PyTest and open source tools that can be run via cloud provider or local servers.
In the world of malware detection, we need to keep innovating all the time to catch the latest APTs. Let's see how can we do it with recent developments in graph analysis using neural networks
In the last decade, we suffer a new epidemic - Advanced Persistent Threats (APTs). It seems like every other week a new kind of malware is born and the attack vectors are becoming more and more sophisticated. from pinpoint targeting of specific machines to massive infection of every machine it tackles on its way. To be able to cope with a large amount of incidents happening every day in the “Everything is connected” age, assigning a human security researcher on every case is expensive and practically impossible. Although sometimes considered as black magic - In recent years we can see the increasing usage of machine learning for malware detection and classification. The suggested solutions, inspired by various fields as Computer Vision and NLP, are implementing cutting edge solutions into the cybersecurity field. In this talk, I’ll show how to use graphs to represent malware and how to use graph embeddings and GCN (Graph Convolutional Networks) to tackle such tasks as malware classification and detection, to help security researchers do their job in a faster and more efficient way.
While most of our online lives revolve around short texts, there's very little information on how to apply NLP techniques on such texts. In this talk, I'll share the lessons we learned and the methodology we developed when dealing with short texts.
“Thanks for all the fish” ; “Happy bday grandma!” ; “Mercedes C-class Cabriolet” . Looks random, right? Well, maybe you know the old saying “one man’s trash is another woman’s treasure”. These texts, while very short, can be a virtual gold mine for many different business use-cases, some of which we tackle daily in our work. When we started working on unsupervised feature generation from very short texts, we started by looking into what’s already been done in the field, and to our surprise the answer was: not a lot. In this talk we’ll share some insights from our experience in dealing with short texts. We’ll start by defining what we mean by "short" in our unique case, why it’s interesting in various domains, where and why advanced out-of-the-box methods failed and finally, provide practical tips for handling short and unusual types of text.
This talk will review some of the most common pitfalls that can cause otherwise perfectly good Pandas code to grind to be too slow for any time-sensitive applications, and walk through a set of tips and tricks to avoid them.
Writing performant pandas code is not an easy task, in this talk I will explain how to find the bottlenecks and how to write proper code with computational efficiency, and memory optimization in mind.
I plan to discuss three archetypical war-stories about fitting in memory. In each of them, I'll describe both the technical challenge and the human biases that needed to be overcome to arrive at sound solutions.
One aspect of handling big data is that typically a problem's dataset does not naively fit into RAM. Three episodes I'd like to discuss: - How to chew thousands of >1GB JSON files without swallowing them whole. - Choosing the right in-memory format for a sparse shortest-path matrix, when the dense version would be prohibitively big, - Choosing a data-at-rest format for large dataset without reinventing the wheel.
I'll discuss the problems, their solutions and the mistakes I made along the way
In this talk, you will learn how at Diagnostic Robotics we create insights from claims data, a form of administrative data at large scale, which provides a great opportunity for AI in healthcare. You will understand how we use medical code embeddings and deep learning methods to build predictive proactive models that benefit the patients and reduce the cost of healthcare. We will also discuss the concept of causal machine learning, its use to emulate randomised controlled trials and see how it’s related to our models.
Genomic sequencing and processing data amounts to many terabytes of data. We'll present how single-cell processing pipe-line requires strong/eventual consistency trade-offs which are different from traditional big-data systems.
immunai runs a complex single-cell RNA sequencing pipe-line. The computational-biology and machine-learning tools eco-system revolves around R and Python. We use cost-effective cloud-storage for the large sequencing files while combining them with strongly consistent meta-data. R/python API users can retrieve the data indexed by any application defined set of labels/features. We will discuss the tradeoffs compared to other big-data platforms like Apache Spark, Elastic Search etc.
In this talk, I will cover shortly the theory of property-based testing and then jump into use cases and live examples to demonstrate the hypothesis library and how we used it to generate random examples of plausible edge cases of our AI model.
Over the years, testing has become one of the main focus areas in development teams, a good feature is a well tested one. In the field of AI this is many times a real struggle. Since eventually most advanced AI models are stochastic - we can’t manually define all their possible edge cases. This led us to use the hypothesis library which does a lot of that for you, while you can focus on defining the properties and specifications of your system.
In this talk, I will cover shortly the theory of property-based testing and then jump into use cases and live examples to demonstrate how we used the hypothesis library to generate random examples of plausible edge cases of our AI model.
The tutorial will introduce two interactive plots libraries: HoloViews, and panel and show how those can be used to create static html files with interactive graphics
The HoloViz project provides a set of Python libraries for high-level visualization of complex datasets. They are particularly useful for handling big data and multi-dimensional data that is common in machine-learning applications. HoloViz technologies support multiple graphical engine backends and integrate seamlessly with flexible development and deployment environments like Jupyter notebooks and modern web browsers. The visualization outputs are interactive, with features such as widgets like sliders or selection boxes or hover tools to inspect data, while not requiring any JavaScript, HTML, CSS, or other web-technology expertise. This tutorial will focus on two HoloViz libraries:
HoloViews: high level interface providing plots (heat maps, histograms, spikes, etc.) in many spatial and temporal combinations, with or without widgets for selecting along dimensions Panel: simple application and dashboard creation from images, plots, Markdown, LaTeX, and other elements into one HTML page incorporating interactive tabs and widgets.
During the tutorial an interactive presentation will be constructed to show the attendees how to construct their own interactive poster / presentation. Sample References: • HoloViz web site: https://holoviz.org • HoloViz on Github: https://github.com/holoviz/holoviz • Jacob Barhak, Joshua Schertz, Visualizing Machine Learning of Units of Measure using PyViz, PyData Austin 2019, 6-7 December 2019, Galvanize Austin. Presentation: https://jacob-barhak.github.io/Presentation_PyData_Austin_2019.html Video: https://youtu.be/KS-sRpUvnD0
This talk introduces PyTorch Lightning, outline its core design philosophy, and provides inline examples of how this philosophy enables more reproducible and production-capable deep learning code.
PyTorch Lightning reduces the engineering boilerplate and resources required to implement state-of-the-art AI. Organizing PyTorch code with Lightning, enables seamless training on multiple-GPUs, TPUs, CPUs as well as the use of difficult to implement best practices such as model sharding, 16-bit precision and more, without any code changes. This talk introduces PyTorch Lightning, outline its core design philosophy, and provides inline examples of how this philosophy enables more reproducible and production-capable deep learning code based on work the following post https://opendatascience.com/pytorch-lightning-from-research-to-production-minus-the-boilerplate/
I’ll discuss an interpretation framework that allows use of the features’ distribution to understand the direction of the feature’s impact. The concept is derived from ideas formulated in Pearl’s analysis of causality in his book “the book of why”.
The subject of interpretability becomes very important as models grow more and more complex but humans need to reason them. Since we don’t want to be blocked by the model’s algorithm (for example, if we want to bag several models), the community offers solutions that are based on alternatives analysis - local assessment, shuffling features, etc.
In this talk, I’ll offer a framework that allows use of the features’ distribution to understand the direction of the feature’s impact, both on the entire sample’s level and for specific observations. The inner workings of this method is highly intuitive and straightforward, and its concept is derived from ideas formulated in Judea Pearl’s analysis of causality (check out “the book of why” for more info).
I’ll present a specific use case of tabular data from Bluevine, and compare its performance to available solutions. I’ll also mention directions for applying a similar method to additional fields.
Nathalie Hauser, Data Science Manager @Bluevine
Join this session to hear about my journey with tree-based classifiers, while tackling the problem of classifying songs into different genres. Learn how XGBoost works and what makes it so popular.
Tree-based models are some of the most common machine learning models used today. It makes sense- the basic concept is easy to grasp and easy to work with. In this talk, we will dive into the concepts behind the names Decision Trees and XGBoost, and discuss the advantages and disadvantages in comparison to other machine learning models. On the music side, we will discover how to extract features from songs and how to use them to differentiate between genres. This talk is intended for anyone with basic familiarity with machine learning that would like to deepen their understanding in the subjects of tree-based models, classification, and how to apply machine learning to songs.
Most data scientists are focused on predictive (aka supervised) projects, yet the real growth is usually in the estimation of action effects and optimizations of action policies. To this end, I will present causal inference and related packages.
There are three layers of analytics: descriptive (BI), predictive (supervised modeling), and prescriptive - the latter, the less-known one, focus on answering the most important business questions. For example, "what was the effect of giving a discount" ( or "what should I do to create the desired effect" - In this talk, we will first discuss what frameworks are used to answer these questions, namely causal inference, and reinforcement learning. Then we will deep dive into CI and discuss in causality crash 101 courses why is it important. Last but not least we will present existing causal-inference open-source packages and their limitations.
We all heard about huge transformers that cost millions of dollars to train, and achieve amazing results. But is there still room for the little guy, with a single GPU and a small budget to innovate in NLP ? Well, have you heard about grounding ?
We all heard about huge transformers (e.g. gpt3, dale, etc) that cost millions of dollars to train, and achieve amazing results. But is there still room for the little guy, with a single GPU and a small budget to innovate in NLP ?
In this talk we would describe the natural language grounding technique, that takes world context into account and achieved impressing results.
We would demonstrate how instruction parsing could be done more efficiently with a grounded representation. And we will will discuss the similarities with pragmatics (in linguistics).
Label distribution shift is a significant ‘unknown' our models might encounter when facing the real world once they are deployed. In this talk I will provide practical approaches to assist our models to be more robust to such 'unknowns'.
If someone would have told me a year ago that we'd be wearing masks when walking outside and that my daughter's longest time off kindergarten won't be two weeks at August - I'd never believe it! But that's life - things change rapidly, and previously made assumptions might not remain valid. Many of us, Data Scientists, find ourselves working hard to train a model, deploy it to live environment and then realize the real world does not behave as we expected it. Our model crashes upon a reality that is much different than what it is familiar with. The root cause for this gap is the unexpected changes that impact our domain's population. In this talk I will focus on a specific type of 'unknown' change - a shift in the label distribution. I will not only present how your model can be more agnostic to 'unknown' changes, but also provide practical approaches you can apply to your model.
Neural networks don’t have to be black boxes, if you use creative designs and match the architecture to your specific needs, you can create a network as interpretable as linear regression, but without its linear constraints.
Many researchers use fully connected neural networks as a simple go-to model, without trying to match the architecture to the problem at hand. However, thanks to high-level open-source libraries such as pytorch, anyone can construct their own neural network architecture, to fit the requirements of a specific dataset. By creating a logical architecture, which models the generation process of our data, we achieve two goals: 1. Better accuracy on both train and test – since the model generalizes better. 2. Interpretability – we can assign coefficients to different parts of the model, in a similar way to linear regression models, while allowing great flexibility in the actual model. Interpretability is important as it can help us understand the limitations and failings of our model, and engineer a better model, or collect more features, to improve on these areas. We will examine a few examples of the limitations of simple fully connected neural networks, as well as other ML algorithms, and see how we can overcome these using architecture concepts anyone can implement in a few minutes using pytorch.
Test sets are often designed to have a specific composition of cases, with constraints applied to each sub-population. Treating test-set curation as an optimization problem could save precious time and transition us towards a "data as code" paradigm.
Test set preparation is an essential part of any data science project. It is often the case that the test set is not just a random choice of samples, but rather a carefully designed population, with specific limits on the number of cases from each important sub-group. As the constraints get complicated, it often takes a while to get them all just-right. In this talk I'll show how to treat the test-set curation as a constraint-optimization problem that can be automatically solved using linear programming. I will demonstrate an open-source python library, curation-magic, which elegantly does this for you, and argue that treating test-sets as an outcome of such optimization is a desired transition towards a "data as code" paradigm.
Python can do so much, including using python to change python behavior. In this talk, we will see how we can hook over any function in order to create an online “helper” in the style of Clippy for the old Office software. This can be useful for refe
This talk stems from the package I've built dovpanda. dovpanda is an overlay companion for working with pandas in an analysis environment - it hooks over any pandas method and suggests better ways code. We use sys.modules
to replace the original function with a modified version while keeping track of the originals using contextmanagers
. We then use inspect
to understand what parameters were sent by the user so we can employ them to the companion's benefit. Using ast
the companion can also understand information about runtime such as checking whether the function call was used in an assignment or part of a complex statement.
Python is so wonderful, as it lets you control Python itself. This really feels like superpowers. In this talk I hope to scratch the surface of a few examples for such superpowers.
It’s good for feature reuse in machine learning, thereby increasing data science accuracy, velocity, and visibility.
A feature store is a single interface to create, discover, and access features for model training and inference. A wholistic feature store solution containing both storage and transformation layers would ideally include:
- Ingestion - both from streams and batch jobs
- Serving - low latency single features for inference and high throughput bulk features for training
- Transformation / Aggregation logic
- Discovery - features and how to retrieve them
This session will attempt to demonstrate why a feature store is useful, review current solutions, and provide a number of tips on getting started.
In online advertising, we run a lot of online tests to determine which approach boosts our engagement the most. We talk about different ways of online testing through the perspective of a new feature we developed that is based on continuous testing.
Testing different UI components, algorithms, optimization approaches in an effort to boost engagement is becoming more and more prominent in online applications. In this talk we will introduce a feature that is based on continuous online testing, Then we will go over online testing in general and some methods that we can use based on certain constraints of the domain. For instance having a limited time to decide which test group we want to use, to avoid having the test itself affecting the results. Sometimes we are also constrained by deadlines by which we have to conclude testing. In tests like those, we have to balance exploration and exploitation to maximize the test’s payout while still being certain in what we did. With that in mind will explore different methods of running online tests, namely split tests, epsilon-greedy multi-armed bandits, and Thompson sampling. We will go over their pros and cons, and applications. After a short demonstration written in python. We will conclude the talk with reasoning of why we chose the methods that we did.
Google Earth Engine (GEE) is a cloud computing platform with a multi-petabyte catalog of satellite imagery and geospatial datasets. It enables to analysis and visualizes changes on the Earth’s surface using python API,
Google Earth Engine (GEE) is a cloud computing platform with a multi-petabyte catalog of satellite imagery and geospatial datasets With the new geemap Python package GEE users can easily manipulate, analyze, and visualize geospatial big data interactively in a Jupyter-based environment.
The topics will be covered in this lecture include: (1) Brief introduction satellite imagery, (2)introducing the Earth Engine Python API and the new geemap Python package. (2) searching GEE data catalog . (3) displaying GEE datasets. (4) classifying images using machine learning algorithms. (5) Finding the greenest place in Israel in terms amount of vegetation.
In optimization problems speed is important, but unfortunately python isn't optimized to speed. In this talk I'll show how to use python and optimize bottleneck functions to be as fast as possible using different libraries and methods.
In this talk I'll present how to optimize the running time of a bottleneck function, progressing from using python lists to cupy's arrays. CuPy is a relatively new library that allows running calculations on the GPU using an API similar to NumPy.
I'll cover a few optimization techniques such as vectorized data structures, a-priori calculations and parallel operations. I will also showcase how to time the function and simple profiling.
This session will focus on one of the hottest topics of the past two years in the data science ecosystem - Automated Exploratory Data Analysis.
Recently Andrew Ng held a conference where his main claim was that we should be more data-centric in our research. He based his doctrine on various studies and examples that showed significant improvement in model performance once the researchers modified the data.
"If 80% of our work is data preparation, then ensuring data quality is the important work of a machine learning team." Andrew Ng
To provide the model with strong foundations, we must explore and process the data professionally and meticulously. It can be a very long and exhausting process. To help you get through this part successfully, the new 'Automated EDA' field has emerged.
In the lecture, we will explore the field of automation in ML and how it corresponds with the variability of the projects. We will examine what can be automated in EDA and explore the latest feature of two powerful open-source tools - Pandas Profiling and SweetViz.
The audience will receive a link for the sides and to a Colab notebook with examples for: - Exporting EDA report using Pandas profiling and SweetViz. - Exporting EDA report that compares two data sets. - Exporting EDA report that compares two categories. - FAQ
Topic Modeling’s objective is to understand and extract the hidden topics from large volumes of text. Using a technique based on Sentence-BERT, we were able to perform the extraction of meaningful topics, and present some evaluation approaches.
Topic modeling is an information retrieval technique for discovering meaningful and interpretable topics in a collection of documents. It allows us to learn something about a set of documents that is too big to read.
In this talk we will cover how we leverage Sentence-BERT using the NLP Python framework sentence-transformers. It provides an easy method for extracting high quality sentence embeddings in a computationally efficient manner, which lays the basis for our topic modeling algorithm.
We will also be addressing the inherent difficulty of evaluating topic models by introducing measuring metrics and visualizations that aid the process of analyzing complex results.
Text analysis in real life can often yield unsatisfactory results due to typos, alternate phrasing, abbreviations and more. In this talk, we'll cover practical and efficient string comparison methods, as well as tackle some commonly faced issues.
A common problem faced by data analysts, data scientists, and many developers who need to analyze and compare data, is that texts are often similar, but not quite identical to one another. This can result from the existence of multiple ways to say the same thing, typos and abbreviations, common yet unindicative words (such as "the") and punctuation, that can all skew the results.
During this talk, I will walk you through several methods to compare inexact texts, using a few different libraries, cover the usages as well as advantages & disadvantages of each method, and tackle some commonly faced issues.
By the end of the talk, you should have a good basis to start comparing texts efficiently and elegantly in your code.
AutoML is a python driven tool we built in Outbrain Recommendations group. In this talk we'll share motivation for creating this tool, describe the general architecture and do a live short demo.
Recently Outbrain CTR prediction system was heavily reworked. In this talk, we will share our key enabler in this journey, a Python-based AutoML engine which allows data scientists to perform faster offline research iterations. This tool is a robust and highly parallel search engine built solely in Python. In this talk we'll share the motivation for building this tool, go through the general architecture and showcases some of its capabilities in a live demo.
Immunai has built one of the largest centralized immune single-cell data assets in the world and is using AI with it to expand the boundary of our understanding of core immune biology and how it translates to the clinical setting.
Our ability to interrogate and decipher the immune system has dramatically improved over the last 5 years with major advances in single-cell multiomic technology, both in the wet lab and in silico. Immunai has built one of the largest centralized immune single-cell data assets in the world and is using it to expand the boundary of our understanding of core immune biology and how it translates to the clinical setting. But this massive data asset offers a unique challenge in how to understand individual cell types, patients, diseases and treatments in the context of all the others. Immunai tackles this problem with cutting-edge artificial intelligence coupled tightly with our functional genomics platform, which together identify core biological mechanisms that enable us to develop the next generation of immune system therapeutics.