Beyond KMeans - using LLMs to improve text clustering

Noah Santacruz

Language: English

The presentation was given on 2024.09.16 at PyCon Israel 2024 - Conference.

Text clustering is a fundamental process in NLP, but what do you do when your clusters just aren’t right? I will share my journey where I ended up combining sklearn and langchain to reduce duplication and "Misc" clusters.

Text clustering can be used to organize text, analyze data, help extract topics or segment customers by interests. In addition, we are blessed with many cheap and high-quality text embedding APIs that should improve clustering. But what do you do when your clusters just aren’t right? Topics are duplicated, clustering is too sensitive to vocabulary, and there’s always that one giant “Misc” cluster.

As a newcomer to clustering, I gravitated towards KMeans since it seems to be the default clustering algorithm. I will share the pitfalls I encountered, the clustering algorithms I explored and how I incorporated LLMs into the process to achieve far better results.