Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

Lülf, Christian; Lima Martins, Denis Mayr; Vaz Salles, Marcos Antonio; Zhou, Yongluan; Gieseke, Fabian

Research article in edited proceedings (conference) | Peer reviewed

Abstract

The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.

Details about the publication

PublisherVLDB Endowment
Book titleProceedings of the VLDB Endowment (Volume 16)
Page range2845-2857
Publishing companyACM Press
Place of publicationVancouver
Edition11
StatusPublished
Release year2023
ConferenceVLDB 2023, Vancouver, Canada
DOI10.14778/3611479.3611492
Link to the full texthttps://dl.acm.org/doi/abs/10.14778/3611479.3611492
KeywordsMachine Learning; Decision Trees; Index Structures

Authors from the University of Münster

Gieseke, Fabian
Chair of Machine Learning and Data Engineering (Prof. Gieseke) (MLDE)
Lima Martins, Denis Mayr
Chair of Machine Learning and Data Engineering (Prof. Gieseke) (MLDE)
Lülf, Christian
Chair of Machine Learning and Data Engineering (Prof. Gieseke) (MLDE)