Multistore

The success of NoSQL DBMSs has pushed the adoption of polyglot storage systems that take advantage of the best characteristics of different technologies and data models. While operational applications take great benefit from this choice, analytical applications suffer the absence of schema consistency, not only between different DBMSs but within a single NoSQL system as well.

In this context, the discipline of data science is steering analysts away from traditional data warehousing and towards a more flexible and lightweight approach to data analysis. The idea is to perform OLAP analyses in a pay-as-you-go manner across heterogeneous schemas and data models, where the integration is progressively carried out by the user as the available data is explored.

We propose an approach to support data analysis within a high-variety multistore, with heterogeneous schemas and overlapping records. Multistores are data management systems that enable query processing across different and heterogeneous databases; besides the distribution of data, complexity factors like schema heterogeneity and data replication must be resolved through integration and data fusion activities. The multistore solution that we have developed relies on a dataspace to provide the user with an integrated view of the available data.

Our approach supports relational, document, wide-column, and key-value data models by automatically handling both data model and schema heterogeneity through a dataspace layer on top of the underlying DBMSs. The expressiveness we enable corresponds to GPSJ queries, which are the most common class of queries in OLAP applications. We rely on Nested Relational Algebra to define a cross-database execution plan. Different strategies to carry out joins and data fusion are evaluated by means of a self-learning black-box cost model, which estimates execution times and selects the most efficient plan. The system has been prototyped on Apache Spark.

Downloads

  • Set of queries: link
  • Dataset (scale factor 1): link
  • Dataset (scale factor 10): link
  • Dataset (scale factor 100): link

Publications

Chiara Forresi, Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Cost-based optimization of multistore query plans. Information Systems Frontiers (2022)

Chiara Forresi, Matteo Francia, Enrico Gallinucci, Matteo Golfarelli: Optimizing Execution Plans in a Multistore. ADBIS 2021: 136-151

Chiara Forresi, Enrico Gallinucci, Matteo Golfarelli, Hamdi Ben Hamadou: A dataspace-based framework for OLAP analyses in a high-variety multistore. VLDB J. 30(6): 1017-1040 (2021)

Hamdi Ben Hamadou, Enrico Gallinucci, Matteo Golfarelli: Answering GPSJ Queries in a Polystore: A Dataspace-Based Approach. ER 2019: 189-203