HoloClean: Data Quality Management with Theodoros Rekatsinas
Play • 57 min

Many data sources produce new data points at a very high rate. With so much data, the issue of data quality emerges. Low quality data can degrade the accuracy of machine learning models that are built around those data sources. Ideally, we would have completely clean data sources, but that’s not very realistic. One alternative is a data cleaning system, which can allow us to clean up the data after it has already been generated.

HoloClean is a statistical inference engine that can impute, clean, and enrich data. HoloClean is centered around “The Probabilistic Unclean Database Model”, which allows for two systems–an “intension” and a “realizer” to work together to fill in missing fields and fix erroneous fields in data.

HoloClean was created by Theo Rekatsinas, and he joins the show to talk about the problem of fast, unclean data, and his work with HoloClean. We also talk about other problems in machine learning and the engineering workflows around data.

More episodes
Search
Clear search
Close search
Google apps
Main menu