Large-Scale Kernel Density Estimates - Smoother is Better

Jeffrey Phillips, University of Utah, USA

Abstract:
A surprisingly common query for data analysis is for the density of data around a query point. A typical approach is a range query that reports the number of points within a fixed radius. We describe in this work how to modify this fixed radius to a smooth one, and on the way greatly improve the robustness and un-coincidentally the efficiency.

The kernel density estimate KDE_P of a point set P is a function that estimates the density of the point set at any query point q by averaging the effect of a kernel (a similarity function, e.g. a Gaussian) between q and each point in P. In particular we will describe how to take a massive point set P and reduce it to a much smaller set S (independent of the size of P) so the KDE_P is everywhere close to KDE_S.
We show that this can be done much more efficiently for a kernel density estimate than one based on using a fixed radius - both in theory and in practice on terabytes of data.
Moreover, this summary can be made mergeable - allowing it to be maintained and updated in a streaming fashion or by aggregating across multiple data sets without sacrificing any error-versus size guarantees.

joint work with: Yan Zheng, Jeffrey Jestes, and Feifei Li

Biography:
Jeff Phillips is an Assistant Professor at the University of Utah. Before that he was an NSF CI Postdoctoral Fellow also at Utah. He earned his PhD at Duke University while on an NSF Graduate Research Fellowship and a James B. Duke Graduate Fellowship, and his BS in CS and BA in Math at Rice University. His research interests are in algorithms for big data analytics, an emerging area that spans the fields of algorithms, computational geometry, data mining and databases.