PyCon Israel 2019
Schedule
A collection of 3 short stories, each one crazier than the last: from tracing and metaprogramming to patching frame objects and rewriting bytecode.
Python is an incredibly powerful language, but just how powerful it is can wrinkle your brain. 3 simple enough ideas — ordered class attributes, c-like enums, and methods with an implicit self — are implemented with ever crazier (and ever more powerful) tricks, that shouldn't be used under any circumstances (and are therefore completely useless. In fact, you probably shouldn't come).
Seriously though, I'll let the code speak for itself:
@ordered()
class A:
    x = 1
    y = 2
    z = 3
assert A._order == ['x', 'y', 'z'] # In Python 2.7, too!
@cenum()
class A:
    a
    b
    c = 10
    d
assert A.a == 0
assert A.b == 1
assert A.c == 10
assert A.d == 11
@selfless
class A:
    def __init__(x):
        self.x = x
    def f():
        return self.x + 1
a = A(1)
assert a.x == 1
assert a.f() == 2
If you want to find out how this is possible — and learn a few things about tracers, metaclasses, frames and code objects along the way — this talk is for you.
Most projects have a point where they become a product, python projects are no different. Productionization of python projects starts with the first file in the repository, this talk will try to go through the process.
Contrary to the Zen of Python, python projects don't have one obvious way of Productionization, but for a specific use, there should be one preferred way.
This talk will try to cover main topics in a Python project life-cycle as they relate to Productionization:
Base project structure
Environment awareness vs. agnosticism
Packaging formats
Versioning practices (PEP440 vs. SemVer 2.0.0)
CI practices
Public and Private distributions
Dependency managers
Some nit-picking and niche issues
A summary of commercial and open source tooling for python code security automatic scan.
When asked about our python code "Is it safe ?", we may use scanning tools (commercial and open source) to give a quantified answer, both for our code and external modules. We will follow regulatory and production use cases, with overview of available commercial tools, and review the "Howto" use of open source scan tools.
Amenity Analytics does NLP and machine learning using serverless infrastructure almost exclusively. In this talk we'll cover what we do, how we optimize it, and how we scale analysis to meet demand
Running NLP (Natural Language Processing) and language based machine learning models can be time and resource consuming. For us, it's part of our daily workflow to run new models and new algorithms on data flows consisting of hundreds of thousands of documents. Since each of the documents we analyze can be about 50 pages long our usual use case involves analyzing 5-10 million pages of fine print financial text. With our legacy system that is based on Java 8 code running on docker containers it can take between 5 hours to 5 days to finish one analysis cycle depending on the complexity of what we're running. 4 months ago we started two major evolutions at Amenity - one for moving our entire ETL (Extract Transform Load) process to run on serverless infrastructure using python via AWS lambdas, Kinesis streams, and DynamoDB. The other was to rewrite our NLP engines in python and cython so that they could run easily on AWS Lambda functions. The results are thousands of CPUs waking up in a matter of several seconds and doing extremely efficient and accurate ETL and NLP work finishing work loads between x10 to x100 faster than we used to. As a bonus we've got a more efficient team working in Python in a CI/CD environment, and our cloud costs reduced. In the talk we'll go through the decision process of moving to python, cython, and serverless, how our architecture shifted and evolved over time, and how we plan to solve our next major challenge - a billion articles analyzed in less than an hour!
Descriptors are a less know but powerful Python tool. We'll see what they are and how to use them.
Descriptors are a less know but very powerful tool. In this talk we'll recap how the dot operators (getattr) work in Python. See what descriptors are and how you can use them to implement things like validators and even the built in properties.
Data classes are very useful, but not often used in Python. This talk will discuss how using data classes in Python improves code readability, usability and robustness, and show how to create them easily with the attrs and dataclasses libraries.
Data classes are a common and useful design pattern, but they are much less popular in modern dynamic languages such as Python and Javascript. However, as codebases and developer teams grow, the other common alternatives, such as using plain native objects (e.g. lists and dicts), often become a liability.
A major reason for not using data classes is that writing them introduces quite a bit of boilerplate code, which is cumbersome and often considered "unpythonic". Lately, the attrs library has gained popularity as a good way to write data classes with very little boilerplate. Doing so gives additional benefits, such as easily adding validations and user-friendly textual representations.
In the most recent release of Python, version 3.7, a new dataclasses library was added to the standard library. Inspired by attrs, it makes using data classes in Python easier than ever, without even the cost of an added dependency.
After this talk, participants will have a good idea of when data classes should be used, and have all of the knowledge and understanding needed to implement them well.
logger.info - practical considerations when using loggers
Logs. We all use them, we all need them, but most people know very little and pay very little attention to setting them up correctly. It’s common to see people going on about stuff like choosing the right DB or web framework, while all along, choosing the right logging solution, setting up your log handlers and making sure they are collected properly is often overlooked. In this talk, i will walk you through the secret lives of loggers, log handlers, and compare different logging solutions and approaches.
Listing the different concurrency options in python
In the last few years I had to run task currently. In some cases the tasks were running on the same machine, in some case I had tasks running on different machines. I want to share with you the apis, pattern and technologies that I found to do these task.
In particular I will talk about:
- Python executor objects
- Python asyncio library
- 3rd parties for parallel computing (numpy)
- Dask
My talk will be based on jupyter notebook and I will try to provide as much examples as possible
Extendable and easy-to-use lightweight enumeration alternative.
Presentation of usage of enumeration implementation by wrapping collections.namedtuple . Demonstrate advantages of proposal versus standard enum APIs: * Less verbose * Compatible with built-in types * Collect constant objects - e.g., string literals - into a single container * As above - with conversion, e.g. - to pathlib.path * Simple processing of nested JSONs - one-stop JSON object creation * One-stop loader for configuration files * Automatic generation of sequences - e.g., KB, MB, GB, ...
Test code, no different from the production code, needs to follow some principles, patterns, and practices to make it usable, standing the test of time and change, without rotting and becoming a maintenance burden. Lets talk about making good tests.
Writing System Tests is challenging. It requires the tested system to exposes a proper API/SDK, a test environment that is able to deploy the system, a testing tool/framework that runs and serves the tests and lastly the tests themselves. This talk will focus on pytest as the test framework and how tests should be written with it. We will give real-life examples from oVirt System Tests.
How debugging actually works in Python? what are the differences between CPython and PyPy interpreters and what's their underlying debugging mechanism? learn how to utilize this knowledge at work and up your watercooler talk game.
Knowing your enemies is as important as knowing your friends. Understanding your debugger is a little of both. Have you ever wondered how Python debugging looks on the inside? On our journey to building a Python debugger, we learned a lot about its internals, quirks and more.
During this session, we’ll share how debugging actually works in Python. We’ll discuss the differences between CPython and PyPy interpreters, explain the underlying debugging mechanism and show you how to utilize this knowledge at work and up your watercooler talk game.
Writing tests is a great way to improve the quality of your application, but in a complex application that depends on various 3rd party APIs it can be quite hard.
Pytest makes it easier.
In this presentation will have a quick view on what are fixtures and how to use them in Pytest.
We'll also look at Mocks, Spies, Stubs, Fakes, and Dummies that are all various types of Test Doubles. We'll see a number of use-cases and implementations.
The new eBPF technology in the Linux kernel allows us to perform production-safe analytics in real time with minimal impact on the running system. We show how a developer can monitor and detect performance issues on a live production system, explain
With the introduction of eBPF, a safe and fast mini-VM into the Linux kernel, writing a kernel module in C is no longer a requirement for doing many of the things we often want to do in kernel mode, such as gathering and analyzing performance metrics.
In this session we present the eBPF technology and how it can be used with the BPF Compiler Collection (BCC) Python library. Furthermore, we will talk about kernel and user level tracepoints which can be uses to meticulously monitor an application in a safe manner.
We also give a glimpse at some exciting and innovative usages of the eBPF technology. The session will end with an overview of the key advantages and disadvantages of this technology.
In this talk, I want to share experiences of migrating from a standard Django architecture to a GraphQL-based app using Graphene framework, including keeping consistency in the implementation and shape of API, common API design issues and pitfalls.
Graphene is currently the most popular framework for building GraphQL in Python and it's also an obvious choice for many people who decide on adding a GraphQL layer to their Django applications. After using it for over a year we've successfully built an API with about 50 queries and over 100 mutations on top of existing Django project (Saleor), but we've also learned many hard lessons and discovered shortcomings of the framework that we had to overcome.
In this talk, I'd like to show practical tips on some of the most common problems that a Django developer has to face to build an optimized and maintainable API with Graphene e.g.:
- using useful abstractions to build queries and mutations faster
- optimizing database queries in a graph
- structuring a large Graphene project
- unified error handling
Apart from that, I'd like to bring up a few limitations of the framework that we didn't know about before we used it. Finally, I'd like to summarize with speaking about the most important benefits that adoption of GraphQL brings to modern web applications development - both for backend and frontend.
DataCity is a project aimed at creating a single repository of all municipal data in Israel. I'll talk about the project and the Python toolset we've built to create and manage this large ETL operation.
Municipalities are the branch of the government that probably affect us the most (think education, garbage collection, building permits...). They are also notoriously known to be not transparent - making it difficult for us citizens to make sure that the people in charge are making good use of our taxes and that our city is performing well in comparison to others.
In the beginning of 2019 we (at Hasadna) embarked on a project to make municipalities more transparent - DataCity. In this project we aim to create a single API endpoint for all municipalities' data (normalized, standardized, verified, regularly-updated).
There are a few problems along the way, though:
- They don't really want to be transparent
- Data is of low quality and very non-uniform.
For solving (2) we're building a versatile framework for -
- extracting data from various sources and formats,
- cleaning it,
- mapping it to a predefined schema,
- validating it with domain-specific rules,
- enriching it and finally
- publishing it in our data warehouse
We're doing all that in a reusable way, based on open source tools (mainly the dataflows ETL library, which I'll describe in detail during the talk).
(we also have a solution for (1) - I'll talk about that too :) )
A practical guide for defining effective interfaces between Python applications and Python-based Deep Learning algorithms
Update: Slides available at http://bit.ly/317Dpqe
One untalked merit of the Deep Learning revolution is the dramatic change in how we build Software; the old days of a developer re-implementing the algorithmic sketch are long gone, as frameworks like TensorFlow reduce the distance from a theoretical paper to production to nearly zero.
But is it without a price? What happens when the requirements changes? How can new features be added without requiring to train a new model? How can we continuously improve our DL algorithms without requiring SW changes?
In this talk I’ll present a practical interface which allows continuous development of both your Python application, and your Deep Learning models. We’ll walk through the different trade-offs, and how changes can be introduced on both sides, while keeping your system live in production.
Ansible is a popular tool for configuration management and automation. Let's learn how to use Python to extend Ansible with our own custom plugins and modules.
Ansible is a popular tool that lets one use a simple YAML-based syntax to chain together a large number of built-in modules in order to create automated system configuration processes for a wide variety of systems. Being a flexible tool, Ansible is seeing increasing use for other automation tasks such as building software and running automated tests.
Ansible includes a large number of built-in modules and plugins that allow for interacting with many different tools and services, but sometimes it is not enough. Luckily we can use Python to extend Ansible and add any functionality we might need.
In this talk we will learn about the different kinds of modules and plugins Ansible has and how to use Python to implement the most common kinds.
lets take a tour into Rust's Python ecosystem, reviewing what rust has to offer to python and visit the crates that will build your next Rust Python native extensions.
Rust is getting popular and ranked as the most loved language of 2019 for the 4th year in a row, promising safety AND performance. Agenda:
- Reminder - Python FFI
- Why Rust is Awesome?
- Rust Example
- Comparing Rust to C
- Packaging: Rust Crate -> Python Package in 60 Sec
This talk will go over what's new in Python 3.8, with emphasis on assignment expressions as well as the controversial talks that led Guido to quit his position as BDFL
No one expected Guido to quit as Python BDLF, not to mention over a simple useful feature such as assignment expressions... In this talk I will discuss the new useful Python features and explain them with examples. In addition, the special case of assignment expressions will be reviewed both as a feature and what happened during the PEP discussion and what eventually caused Guido to quit his position as BDFL. Last, a few words about the new era of Python with no BDFL.
A new debugging solution for Python that became a huge hit overnight
I had an idea for a debugging solution for Python that doesn't require complicated configuration like PyCharm. I released PySnooper as a cute little open-source project that does that, and to my surprise, it became a huge hit overnight, hitting the top of Hacker News, r/python and GitHub trending. In this talk I'll go into:
- How PySnooper can help you debug your code.
- How you can write your own debugging / code intelligence tools.
- How to make your open-source project go viral.
- How to use PuDB, another debugging solution, to find bugs in your code.
- A PEP idea for making debuggers easier to debug.
This session might give you what you need to assess whether the latest async Python APIs are right for your environment. We’ll look at challenges, solutions, considerations and most importantly examples.
In this hands on talk, Ronnie Sheer (software engineer at BlueVine) describes the real word challenges of integrating asynchronous code into robust Python applications. We’ll look at APIs using Async/Await and what can be learned from them. We’ll also walk through a small real-time web api built with Quart, a new tiny async (Flask-Like) microframework. If you have been wondering if it’s time to incorporate Async/Await into your project, this might be a talk for you.
Challenges and architectural solutions for mass usage of async FaaS workers.
Demand for "Function as a Service" expertise is quickly growing among Python developers. This talk is addressed to the architects and developers working with anywhere from 1000 to 1 million+ daily invocations of Functions. One of the hardest challenges while using FaaS-based workers is orchestration:
- Chunk business jobs to thousands (or millions) of tasks for workers
- How many workers shall we invoke at a specific moment?
- Track the status of each invocation
- Retry / alert failed invocations
- Monitor collateral damage and side-effects (DB, queues, etc)
- Build dependencies of workers and chain workflows
In this talk we are going to discover even more challenges and discuss how to solve them using Python in the AWS cloud.
In the last part of the talk there is an introduction of the sosw orchestration package with a fast-forwarded demo. The package is now Open Source and open for contributions.
We assume general familiarity of the attendees with FaaS concepts.
Kubernetes/OpenShift is a portable, extensible open-source platform for managing containerized workloads and services. In this session, you will learn how to extend Kubernetes/OpenShift with Python to make you a cup of coffee.
Who doesn't like coffee? What if you could use your Python coding skills to extend Kubernetes/OpenShift to watch for possibly long running processes and make your waiting time pleasant with a cup of coffee.
We will talk about what Kubernetes is, what is a Kubernetes controller, how to write and deploy your own custom 100% Pythonic Kubernetes controller that uses the K8S Python client to listen to different events in K8S from one side, and communicates with your coffee machine on the other.
P.S: Making tea is not in the scope of this presentation.
How did we port our code from Python two to three is less than two months with one dedicated developer while ongoing development? Find out and learn why SAS and microservices are your friends.
At the verge of 2019 our code base was running on our customer's site achieving tremendous value, while at the same time under extensive development, continuously growing and - written in Python 2. Our task was to port our code to Python 3 while maintaining our non-stop development and doing so in a tight timeline, with minimum development hours and customer impact. What do you need to know to complete a successful and fast transition? We will share with you our strategies, plans and unexpected issues. We’ll talk about how Agile, SAS, CI\CD and Microservices play very nicely with the porting process. We will answer important questions such as how the team can keep developing during this process, estimate time and effort, and expect the unexpected, as well as minimize impact on customers. You will be interested to know what was the most painful, what where our pitfalls and what was surprisingly easy. At the end of this talk, you will be equipped for your journey towards the future.
How easy is it to create tools with some Python and Bluetooth devices? If you never want your office mates to mess with our PC in the name of security, this talk is for you.
In this talk we are going to learn how to implement a Bluetooth distance detection device. This device was invented to solve a real life problem which will be discussed. With the Python tips and tricks from this session, you might be able to improve your office dynamics. Furthermore, with a little creativity you can prevent your cat from running away again. Instead of inventing problems for our tools, let’s solve real ones.
In this talk I'll present the python scope of variables, global, nonlocal, closures, LEGB rule and some known and less-known gotcha's of python scopes! attending this talk will might result a small headache ;)
Brief recap of all known python scopes and example:
- local variables
- nonlocal keyword
- global keyword
- closures
- LEGB rule
- generators/coroutines implications
- gotcha's and unicorns
- dis module (needed for the next part)
I'll walk-through actual code real world buggy code snippets . showing dis usage on code examples and how to understand whats going on. a recap of each CPython scope implementation will be given.
Fact checking using python with Jupyter notebooks
The trust in statements made by politicians and other public figures is at an all-time low (Economist, 2017). This may be due to a real increase in the number of false claims made (or “alternative facts” cited), or a side-effect of the low affinity to the truth displayed by the current white-house incumbent (and other leaders around the globe). At the same time, the polarization of the media undermines its credibility as an unbiased fact-checker (The Hill, 2017; American Press Institute, 2017). The good news, on the other hand, is the unprecedented ability of every one of us to independently fact-check claims we are interested in. With plethora of raw data freely available on the internet and accessible tools for data analysis, all we need is to know how to ask the right questions, how to translate these questions into executable programs, and how to implement these programs and interpret their results.
In the age of blunt lies and accessible information, we believe that such skills should be acquired as early as the high-school years. We posit that the way to develop these skills is through hands-on experience, and that the way to motivate students to gain this experience is to let them ask and answer the questions that interest them. With this in mind, we developed an interactive textbook (Biron & Levine, 2019) written completely as a collection of Jupyter Notebooks. The book teaches fundamentals of data science through real-life fact-checking tasks, defined by the reader according to their own personal interests. Our goal is to teach three types of skills: how to design a fact-checking plan, how to retrieve the appropriate data, and how to analyze them. The Jupyter platform allows us to let the reader choose the questions they address and the data sets they use to engage in active learning, guaranteeing that the book is relevant and exciting to everyone, whatever their interests may be.
In this talk I’ll describe our mission to make students – young and old – into budding data scientists, equipped with the methodology, design patterns, and Python, the language of data science. I will also discuss how platforms like Jupyter and Colaboratory provide with tools to revolutionize our teaching and help us adapt our methods to the students of Gen Z.
As a full-stack developer in an Algo-trading / HFT world, I’m being asked very often: “Yeah but, python is too slow for that, nope?” During my talk, I will demonstrate how I deal with performance / high throughput problems using PyPy, including exam
Python is amazing. It gives us the ability the code various applications quickly, using a variety of packages. However, it is relatively slow compared to other programming languages performing the same tasks.
As a full-stack engineer in an Algo-trading HFT company, I often need to create super fast applications with very short-term deadlines.. Python is my prefered choice because I can deliver fast. However, how can I overcome the speed barrier? That’s why I decided to take advantage of PyPy, a fast, compliant alternative implementation of the Python language.
In this talk, I will present some examples of applications in the field of Algo-trading which need to process data lightning fast, and I will focus in how I optimized my solutions using Python and PyPy.
The examples will be taken from the finance/trading world, including how I built and improved a FIX-protocol parser, and how I managed to consume a data feed of all exchanges quotes (Bid / Ask) without having gaps in communication nor dissection.
I will explain what were my major obstacles, how I managed to observe the bottlenecks in the system, monitor them, and finally how I managed to improve them.
The optimizations that will be shown in my talk, will take place in the JIT world of PyPy, including ways to write your code for the PyPy’s JIT, and in addition, I will show some tweaks I have done to make the Garbage Collector (CG) work much better in my environment. In addition, we will see how choosing the right data structure for our needs, improves the overall latency in the system.
In this talk I will present the architecture of our simulation infrastructure written in Python which allows to simulate hours of real-life in only minutes of simulation. I will describe challenges and issues we encountered and how we handled them.
Our company's product uses a fleet of real (not virtual) robots to perform different tasks in a fulfillment warehouse. The importance of simulations is significant: it allows to test our solution, new features and perform regression tests without the need for real and expansive hardware, measure and analyze the impact of different algorithms and optimizations, and explore possible solutions before deploying them in production. Tasks performed by physical robots take time (movement over the warehouse, box lifting, etc.), but in simulation, where virtual robots are used, there is no need to wait all that time. Shortening simulation time improves the development process by providing faster feedback to developers and quicker CI and testing cycles. Another benefit is a more deterministic simulation - using this approach, each component (thread) in the system gets equal opportunity (CPU time) in each time tick, which is not affected by the underlying machine or operating system that the simulation is running on. Also, it is possible to simulate any hour of the day easily, and by that we wouldn't panic before the "Y2K bug". I will also show how we expanded SimPy to adjust this architecture when making the transition to microservices.
Pylint is a Widely used and scalable Python static code analysis tool. In this lecture we will learn how to configure, extend and run it asw part of your CI. Pylint
Pylint is a Widely used and scalable Python static code analysis tool which looks for programming errors, helps enforcing a coding standard, and offers simple refactoring suggestions. Background
Pylint is highly configurable and has many parameters that allows you to match your linter to your exact coding standards. It's also highly extendable and allows you to define your own special checkers (Code analyzers). Pylint could be run as part of your CI to enforce quality standards. Lecture
Agenda: 1. Pylint - What is it and how it works. 2. AST - Python abstract syntax tree - What is it and how we use it for our checkers. 3. Pylint Configuration - How we configure Pylint for our own use cases. 4. Pylint Checkers - How we create our own static code analyzers: - Create Raw checkers, which analyse each module as a raw file stream. - Create Token checkers, which analyse a file using the list of tokens that represent the source code in the file. - Create AST checkers, which work on an AST representation of the module. 5. How to integrate Pylint in your CI.
Pandas is easy and fun to use, so much so that even a Python newbie can use it. I tell the story of how, as a new Python developer, I quickly learned enough Pandas to be able test some hypotheses about financial markets.
During this talk, I tell the story of how, as a new Python developer, I quickly learned enough Pandas to be able test some hypotheses about financial markets. I will demonstrate how I formulated the hypothesis, developed a test plan, gathered data, and tested the hypothesis. Along the way, attendees will see common Pandas techniques for cleaning, converting and reshaping the data, generating new derived data, finding patterns / correlations, and visualizing data. After attending this talk, you may choose to use Pandas to test your own hypotheses for fun (and maybe profit).
The audience for this talk is any level Python developer with little to no experience with Pandas. Through this talk, attendees will learn what Pandas can do, and how it simplifies data analysis. Attendees will also learn how to apply a number of basic Pandas techniques for cleaning data, slicing data, setting and using indices, calculating new columns, merging data sets, generating statistics, and visualizing results.
Outline:
- Introduction
- The Hypothesis- Stock ownership and Earnings
- The Price/Earnings ratio and what it means to analysts
- Hypothesis: Low P/E represents a “bargain” -- How can we test this?
 
- Pandas Basics- Series, Data Frame
- Indexing, loc, iloc, slice
- Utilities: read_csv, describe, apply, select, sort, transpose, merge, handling missing data
 
- Testing Our Hypothesis: Preparing the Data- Introduction to the data
- Cleaning the header; setting the indices
- Converting Price, P/E, and Dividends
- Converting Dates
- Calculating the average and standard deviation of P/E values
- Merging Dividend Data with the Price and P/E data
 
- Testing Our Hypothesis: Calculations and Results- Calculating Total Return (for various periods) and generating new columns
- P/E and Total Return Relationship
 
- Conclusions- Pandas is fun and easy – even for a newbie.
- Any data can be analyzed; the art is in gaining an understanding of the data and being creative in your approach to getting valuable information out of the data.
 
Managing Python environment dependencies across an organization and between environments is always a pain. Pipenv makes it a little less painful by managing both the environment itself and its configuration.
'Import Error: No module named X'. Sounds familiar? Probably all too familiar for most of you. Nearly every time you share pieces of code or send programs for remote execution, you will encounter this message. After all, we're humans and we don't always remember to update the requirements.txt file when installing a new package. So code may work perfectly on one machine, but fails to work on a different machine.
Pipenv to the rescue! With Pipenv, now you can manage your environment and the configuration using the same command. Each time you install a package using Pipenv, a configuration file is updated automatically, and it can then be used to set an environment on a different machine in a reliable manner.
In this talk we will introduce you to Pipenv, the problems it can help you solve, how to use it properly, and caveats from our personal experience.
Implementing your own filesystem used to be difficult, requiring in-depth kernel knowledge. For this reason FUSE (Filesystem in Userspace) was introduced. We will discuss how we can implement our filesystem in python using the fuse python module.
In this talk, we will give a very short and basic overview on FUSE, how it works, and what is libfuse. Then we will discuss the libfuse python wrapper (fuse module) and see how we can use it to implement our own little filesystem.
Our web is a world of silos, we need to decentralize that! Let's try to make it simple, and without the drawbacks normally associated with blockchain dApps. In python, of course.
I started working on decentralized applications projects and didn't really find my way as a pythonista there. Plus, blockchains are slow, clunky... What if we used AsyncIO and latest python stuff to upgrade this?
I'll present Aleph.im, a layer-2 network (complicated words for something simple!) made in Python, and how to make a simple decentralized app using it.
Features of applications made with Aleph.im:
- Elliptic curve identity and message signing
- Free and instant actions
- No server needed, the network does that
- Large data storage (medias, text contents, key-value stores)
- Virtual machines
Most people would not consider the language written within legal documents to be "natural" to any human being. We would demonstrate how their structure can be proved handy with several NLP techniques.
Legal documents are known to be obscure, repititive and overly-complicated. In this talk we would cover our work at BestPractix, and demonstrate how the structure of legal contracts can be used for various tasks. We would cover: (0) Data annotation hacks (1) fast clause and contract classification (2) Identifying defined terms (3) Applying deep learning to learn style.
- talk slides
This presentation will review the strengths and weaknesses of using pre-trained word embeddings, and demonstrate how to incorporate more complex semantic representation schemes such as Semantic Role Labeling, AMR and SDP into your applications.
Since the advent of word2vec, word embeddings have become a go-to method for encapsulating distributional semantics in NLP applications. This presentation will review the strengths and weaknesses of using pre-trained word embeddings, and demonstrate how to incorporate more complex semantic representation schemes such as Semantic Role Labeling, Abstract Meaning Representation and Semantic Dependency Parsing into your applications.
- talk slides
The centralized database that holds clinical trial data is in need of standardization - python tools are used to help this effort
This is joint work with Joshua Schertz
ClinicalTrials.Gov is the database where clinical trial data from all over the world is registered. Today some clinical trials are required to report their finding in this database according to U.S. law. Today this database holds over 300,000 clinical trials with over 10% with numeric results. However, since many entities are entering data into this fast growing database, the data is not standardized. Specifically, numerical data cannot be comprehended since the units are not standardized. There are over 23K different units detected from this database in 2019 - many of those units are similar only written differently. This talk will discuss how we use python tools to 1) process and index the data, 2) find similar units using NLP and machine learning, 3) create a web site to support user mapping of those units. We created ClinicalUnitMapping.com to support the standardization effort of those units. New elements of this presentation will discuss how units from existing medical standards such as UCUM, RTMMS , and CDISC are incorporated in the python processing pipeline. The intention is to create a unit standard that will be able to map all units reported by clinical trials. With such a database, the data in this clinical trials database would become machine comprehensible.
How to create QGIS Python processing tools to improve sharing and integration. Using the power of QGIS to work on location data and use Python power to share the data in various ways and to integrate the data into other systems.
In many companies and organizations, location data is a part of the day to day work and part of the important data that the organization uses. Location data comes in many different forms and sources. It can be GPS data, address data, list of coordinates, etc. It can also come in various formats; GPX, SHP, KML, GeoJson, etc. In order to work on or analyze the data, the best way is to use a GIS application, my suggestion is to use QGIS. QGIS is a free and open-source cross-platform desktop geographic information system (GIS) application that supports viewing, editing, and analysis of geospatial data.
To improve your work in QGIS you can write scripts that can run in QGIS as python processing tools. in my presentation, I will show how to write scripts to integrate the location data from different sources to QGIS or how to share your location data from QGIS to other workspaces.
For example, I will show how to use Google Sheets API as a data source or how to create HTML pages from QGIS to share your data with the python scripts. I will use some python packages like Pandas, Request, etc.
Model explainability, why should we care. How to explain ML models? SHAP (SHapely Additive exPlanations),
ML-based models are becoming prevalent and affect more and more aspects of our lives. As we rely on models to approve loans, decide whether to hire someone or receive medical treatment, we need to understand why and how the model generates its predictions. What is model explainability? How can we interpret models? SHAP library - How to install - How to generate a global explanation - How to generate a local explanation - Visualizations
Numpy is used as a data container and its APIs are well-defined. What you might not know about things that have changed lately and what you need to know about changes that are coming
For the past year I have been one of three full-time NumPy contributor paid for by a non-profit organization. We have defined a roadmap, added new interfaces to the package, tightened up some of the APIs, improved performance, and merged a major refactor of the random module.
Why this important, what else is yet to come, and how will this affect you, the end-user.
Data scientists spend over 60% of their time doing feature engineering. I will discuss automation of the feature engineering process, using Featuretools, in order to significantly reduce time investment, make it repeatable, more robust and creative.
Data scientists spend over 60% of their time getting familiar with data, understanding features and the relationships between them, and ultimately creating new features from the data. This process is called feature engineering. It is a fundamental step before using predictive models and directly affects the predictive power of a model.
Traditional feature engineering is often described as an art: it requires both domain knowledge and data manipulation skills. The process is problem-dependent and might be biased by personal skills, loss of patience during data analysis, and many other factors which depends on the personality of the data scientist and prior experience in the field. Recently it has been proposed that making feature engineering an automated process would significantly reduce time investment in this early crucial step of modeling. In addition it will be repeatable, more robust and creative.
Featuretools is an open source automated feature engineering library that was created by the developers at Feature Labs. In my talk I will present the Featuretools library, its concepts and functions. I will address the very important question - to which extent can feature engineering be completely automated? I will discuss different scenarios presenting pros and cons. Finally, we will implement auto feature engineering and explore code examples.
Social network analysis is the study of social structures through the use of graph theory. In this talk I will present network theory and application of building and analyzing social networks for practical use-cases in Python with NetworkX.
Social network analysis is the process of investigating social structures through the use of networks and graph theory. It combines a variety of techniques for analyzing the structure of social networks as well as theories that aim at explaining the underlying dynamics and patterns observed in these structures. It is an inherently interdisciplinary field which originally emerged from the fields of social psychology, statistics and graph theory.
This talk will cover the theory of social network analysis, with a short introduction to graph theory and information spread. Then we will deep dive into Python code with NetworkX to get a better understanding of the network components, followed-up by constructing and implying social networks from real Pandas and textual datasets. Finally we will go over code examples of practical use-cases such as visualization with matplotlib, social-centrality analysis and influence maximization for information spread.
Code examples for the sessions can be found here: https://github.com/dimgold/pycon_social_networkx
Video of the talk can be found here: https://www.youtube.com/watch?v=px7ff2_Jeqw
In this talk I will demonstrate how AI outperforms traditional triage measures for predicting early and late mortality after emergency department (ED) visit, using EMR data of ~56K ED visits of adults patients over a period of 5 years.
Emergency departments (ED) are becoming more often overwhelmed increasing the poor outcomes of ED overcrowding. Triage scores aim to optimize the waiting time and prioritize the resource usage according to the severity of the medical condition. However, the most widely used triage scores, relies heavily on provider judgment which can lead to inaccuracy and misclassification. Artificial intelligence (AI) algorithms offer advantages for creating predictive clinical applications because of flexibility in handling large datasets from electronic medical records (EMR), and are becoming better at prediction tasks, often outperforming current clinical scoring systems.
In this talk I will present a collaborative study between Intuit’s AI-ML researchers and Shiba hospital innovation center. In this study we used EMR data to predict mortality in the at early triage and ED (emergency department) visit level. The study included 559,353 ED visits of adults patients over five years (2012-2017). Variables included: demographics, admission date, arrival mode, referral code, chief complaint, previous ED visits, previous hospitalizations, background diagnoses, regular drugs, vital Signs and ESI score. Using XGboost, Catboost and deep-leaning we yielded an AUC of 0.97 for early mortality, an AUC of 0.93 for short term mortality, and an AUC of 0.93 for long term mortality (90 days from admission). Single variable analysis shows that the two variables with the highest AUC were: age and arrival mode for early mortality; age and main cause for short term mortality; and are age and number of drugs for long term mortality. The highest information gain variables for early mortality (calculated using XGBoost with a 1000 trees) were SBP, HR, days to most recent previous ED visit and fever. The results outperform e
This real-world model operationalization case study will highlight common mistakes, propose solutions, workarounds and practical tips to successfully deploy a ML model
A Case study: How to effectively operationalize a Machine Learning model
A European shipping company was looking to gain a competitive advantage by leveraging Machine Learning techniques. The aim was to create shipping-lane specific demand forecasting, and to implement it throughout its operations, in order to: save time and manual labor, adjust pricing and business agreements, and utilize smart resource allocation. Each percentage of improvement is worth $1.5 million.
In order to effectively operationalize a Machine Learning model you need to cross 3 chasms: the first is Business Relevance - avoiding model development before thinking the business value through. A clear vision of the desired business impact must shape the approach to data sourcing and model building.
The second is Operationalizing Models - migrating a predictive model from a research environment into production. This process can be difficult because data scientists are typically not IT solution experts and vice versa.
The third, and most critical chasm is Translating predictions to business impact - where a data scientist ensures the decision makers understand the predictions and have enough wiggle room to take action and turn it into a competitive advantage. Management must possess the muscle to transform the organization so that the data and models actually yield better decisions. Additionally, model outputs need to be integrated into well-designed applications making them easy to consume.
In this talk, I will explain these three elements using a real-world case study. I will highlight common mistakes to avoid when operationalizing a Machine Learning model in an enterprise environment. I will present specific lessons learnt and practical tips from this real world project.
Programs which aim eradicate disease must rely on interpretable models. These models quickly become hard to solve, not to mention train on missing parameters. Scipy and PyMC come to our rescue for the heavy lifting.
In 2018, Israel has seen the biggest outbreak of measles since the introduction of a vaccine in the late 1960s. Nowadays, vaccine policies are not only decided by laboratory tests. Those tests are complemented by a plethora of computational epidemiology simulations predicting the effects of various vaccination policies on the entire population. A population-level policy to eradicate disease must rely on Interpretable models. These models quickly become hard to solve, not to mention train on missing parameters. Using Scipy as a solver, and PyMC for Bayesian inference we are able to learn parameter distributions for missing natural parameters, such as the disease's "strength" or "infectiousness". We can then use the underlying distributions for these parameters in order to simulate possible outcomes for future policies.
Hierarchical Temporal Memory is a novel framework for biological and machine intelligence. It learns patterns from relatively little data and is well suited for prediction, anomaly detection, classification and ultimately sensorimotor applications.
Hierarchical temporal memory (HTM) is a biologically constrained theory (or model) of intelligence, originally described in the 2004 book On Intelligence by Jeff Hawkins with Sandra Blakeslee. HTM is based on neuroscience and the physiology and interaction of neurons in the neocortex of the mammalian (in particular, human) brain. At the core of HTM are learning algorithms that can store, learn, infer and recall high-order sequences. Unlike most other machine learning methods, HTM learns (in an unsupervised fashion) time-based patterns in unlabelled data on a continuous basis. HTM is robust to noise, and it has high capacity, meaning that it can learn multiple patterns simultaneously. When applied to computers, HTM is well suited for prediction, anomaly detection, classification and ultimately sensorimotor applications.
In this talk I will cover some basics of the mammalian neocortex, provide an introduction to Hierarchical Temporal Memory (HTM), briefly compare it to Deep Neural Nets (DNN) and show some examples of HTM in action.
Using DASK in an ETL pipeline has some gotcha's. Although there are many similarities to pandas there are some issues and best practices that can optimize the usage of DASK in general
The presentation agenda:
- Intro to Dask framework
- Basic setup Client
- Dask.dataframe
- Data manipulation
- Read/Write files
- Advanced groupby
- Debugging
There is a jupyter notebook (see attachment) to supplement the talk. See also: jupyter notebook of the presentation (163.4 KB)
In this talk I will describe an end-to-end solution to a text classification problem using publicly available frameworks. I will focus on the practicalities of getting a Deep Learning-based text classification model up and running.
2018 has been declared by many as the "ImageNet moment" of NLP. Novel attention and transformer-based Neural Network (NN) architectures significantly improved state-of-the-art performance in many tasks. NLP-oriented transfer learning techniques claim to make text classification easy by adapting pre-trained models, trained on huge corpora, to proprietary datasets with only a very small number of labels. Models that previously required significant computational power over vast periods of time can now be trained in several hours on standard CPUs. But with all of these models and frameworks to choose from, how does one make sense of it all? Where to begin? In this talk I will describe an end-to-end solution to a text classification problem. I will demonstrate how to employ the available classification methods, evaluating their performance and also (arguably more importantly) their ease of use. I will highlight common pitfalls, explaining what it takes to get a Deep Learning-based text classification model up and running.
See also: Presentation slides (681.3 KB)
Data is Twiggle's bread and butter, so choosing the right data pipelining framework was critical for us. After comparing Luigi and Airflow pipelines we ended up selecting both! We’ll explain why, present our unique challenges and chosen solutions.
Organizing and scaling data pipelines is a common challenge many organizations face. Luigi and Airflow are the two most popular open source frameworks to help solve this task.
We will present a quick overview and comparison of the two. Then we will take a deep dive, including code examples, into the special cases for which we used the frameworks at Twiggle.
Among the examples we will discuss:
- Airflow as a highly available web server, and extending it with APIs for customers.
- Data processing using Dask and Spark in Luigi.
- Code reuse in Luigi vs Airflow.
code: https://github.com/orrshilon/pycon-israel-2019-airflow-luigi slides: https://github.com/orrshilon/pycon-israel-2019-airflow-luigi/blob/master/PyCon%202019%20-%20Data%20Pipelines%20-%20Airflow%20vs.%20Luigi%20by%20people%20who%E2%80%99ve%20made%20mistakes%20in%20both.pdf
Discover Python's ast module to see how you can analyze and generate Python code.
Have you ever wondered how much of your code could be generated automatically? Introspection, mutation, extension - Python's uber-dynamic nature allows us to do all kinds of kinky stuff. At this workshop, we will take a look at the 'ast' module and see how it allows us to analyze and produce Python code. By leveraging it's power, we will try to create some useful tools.
Following the talk about Aleph.im in the previous days, we will create a device that pushes its data to a decentralized and verifiable (signed data) network.
Creating devices with Python is now possible for a few years (I even did a workshop about this two years ago here), but what if you want to guarantee your data can't be tampered, and don't want to trust a server either?
You will need the power of decentralized applications, elliptic curve cryptography and smart contracts. Wow, that sounds hard, plus it will cost money (tokens), bwah. Nope, not with Aleph.im: it's quite simple and we will do it together!
We will setup micropython on small devices and start sending a program to the devices sending sensor data.
The interesting part here is that we won't use standard storage servers but api servers of a decentralized cloud. The aleph.in network will take care of storage and processing of our data, giving us and others guarantees of immutability (hashes stored on a few blockchains).
No pre-requisites here, beside knowing python.
Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? In this tutorial we'll work with networks using Python, and we will look at various real world applications of network science.
This tutorial will introduce the basics of network theory and working with graphs/networks using python and the NetworkX package. This will be a hands on tutorial and will require writing a lot of code snippets. The participants should be comfortable with basic python (loops, dictionaries, lists) and some(minimal) experience with working inside a jupyter notebook.
The tutorial will follow the give outline:
Part 1: Introduction (30 min)
- Networks of all kinds: biological, transportation.
- Representation of networks, NetworkX data structures
- Basic quick-and-dirty visualizations
Part 2: Hubs and Paths (40 min)
- Finding important nodes; applications
- Pathfinding algorithms and their applications
- Hands-on: implementing path-finding algorithms
- Visualize degree and betweenness centrality distributions.
Part 3: Game of Thrones network (1 hour)
- Construct the Game of Thrones co-character network
- Analyze the network to find important characters and communities in the network.
Part 4: US Airport Network (1 hour)
- Construct the US Airport Network using data from the last 25 years
- Analyze the temporal evolution of importance of airports in the network
- Optimizing the network
Part 5: Advanced Network Science Concepts (20 mins)
- Quick introduction to machine learning on networks
- Parallel ideas between linear algebra and consensus in networks
By the end of the tutorial everyone should be comfortable with hacking on the NetworkX API, modelling data as networks and analysis on networks using python.
The tutorial and the datasets are available on GitHub at: https://github.com/mriduls/pydata-networkx
One of the best way to understand how programming languages work (including Python), is to implement one.
Greenspun's tenth rule states that Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.. Understanding how programming language work will make you a better programmer and gain a better understanding of Python itself.
We'll implement a small lisp like language and discuss language design & implementation issues and how they are found in Python. - Lexing & Parsing: What are the implication of Python using whitespace for indentation? - Variable scope & closures: Why we have global and nonlocal in Python - Types: Why the value of 1/2 changed from Python 2 to 3 - Evaluating code: Python's eval vs exec and byte code interpreter. Why does or and and short curcuit
Python is known for its ability to interoperate with other languages. We will learn how to call C from Python using CFFI, Ctypes, Cython, and others
The workshop is based on the jupyter notebook avaiable here. You will want to make sure you have an appropriate python3 environment ready, including cppyy which can be installed via pip.
Starting from a simple pure-python program to produce a Mandlebrot fractal image, we will learn how to:
- Benchmark and profile python performance
- Write the time-critical code in C
- Call the c code from Python using CFFI, Ctypes, Cython, and if time allows CPPYY and PyBind11.
- Compare and contrast the methods: ease of use, maintainability, speed and popularity
At the end we will discuss other solutions for making the pure-python version fast.
Named-Entity-recog and document-classification are by far the most common NLP tasks. This workshop would focus on the named-entity-recognition (NER) with deep learning methods.
Natural language processing is an umbrella term for several tasks, common tasks include document-classification, machine translation, and named-entity-recognition. Deep learning methods had revolutionized the NLP field, breaking state-of-the-art benchmarks in all of these fields. This workshop would focus on the named-entity-recognition (NER) with deep learning methods. The workshop is hands-on, meaning that participants are required to bring their own laptop with all the requirements installed. Python proficiency is assumed, and machine-learning background is a big plus.

























































