Don’t Underestimate the Obvious: Murphy’s Law in Real-life Data Science
Murphy’s Law states that if anything can go wrong it will -- and this is particularly true in data science. Based on personal experience, I describe how to create an effective model despite data pitfalls, methodological hazards and hidden bugs.
“If anything can go wrong, it will”, states Murphy’s Law, and this holds particularly true in data science. Whereas the algorithms used in data science are mathematically flawless, the path to creating an effective model is often rife with obstacles, such as data deficiencies, methodological pitfalls, hidden bugs and human mistakes. In this talk, I offer lessons about common obstacles and how you can avoid them in your projects, as I have learnt from nearly a decade of data science work. Topics include: how data often violates our intuitive assumptions, and how Python tools can detect such violations. How to confidently build a model by starting from simple baselines and using synthetic data to your advantage. How to avoid common pitfalls – such as unintentional overfitting, train set contamination and inconsistent package versions – via defensive programming and code reviews. And how to guarantee long-term code correctness via Pytest.