CubeLoad

CubeLoad is a parametric generator of OLAP workloads. Its main features are:

  • No predefined multidimensional schema is used. The benchmarker can create a workload for any multidimensional schema provided it has been exported in XML compliant with the Mondrian format.
  • The workload is generated in the form of sessions, each including a variable number of aggregate queries. The main parameters used are related to a realistic profile-based workload model.
  • Sessions are generated according to a set of four templates, that model recurrent types of user analyses.
  • If an instance of the multidimensional schema is available (in particular, in the form of a set of dimension tables), its data are used for generating instance-dependent (hence, more realistic) workloads.
  • The generated workload is exported in XML to ensure maximum usability.

CubeLoad is written in Java and can be downloaded from here. It can be freely used by researchers, practitioners, and vendors whenever they need to create parametric bulk OLAP workloads for benchmarking and testing.

A functional overview of the CubeLoad architecture is sketched below.

The main input is the multidimensional schema on which the workload is to be generated. To provide this input we adopt the XML specification used by Mondrian for its metadata. To maximize interoperability, the workloads generated by CubeLoad are coded using XML. To generate realistic selection predicates and enable report sizes to be estimated, dimension data are needed. These data can be fed into CubeLoad using the CSV (comma-separated values) format, which can be easily obtained by benchmarkers by exporting dimension tables. The output of CubeLoad is an OLAP workload, defined as a set of sessions. A session is a sequence of queries. In the current implementation, we support a basic form of multidimensional query consisting of

  1. a group-by (i.e., a set of hierarchy levels on which measure values are grouped);
  2. one or more measures whose values are returned (the aggregation operator used for each measure is defined by the multidimensional schema); and
  3. zero or more selection predicates, each operating on a hierarchy level.

We call report the result of a query; its size is the number of facts returned. Roughly, the size of a report can be estimated as the product of the domain cardinalities for all levels in the query group-by, reduced by considering the selectivity factors of the selection predicates. Two consecutive queries within a session are normally separated by the application of one OLAP operation, that changes either the group-by, or the selection predicate, or the set of measures returned.

Each session generated by CubeLoad for a given profile starts from one of the seed queries for that profile and evolves, consistently with global and profile parameters, according to a template. In its current implementation, CubeLoad uses four different templates for generating sessions:

  1. Slice-and-Drill. In several OLAP front-ends, the default behavior when a user clicks on a row/column of a pivot table is to disaggregate the values for that row/column into its components, which in OLAP terms means slicing and drilling down. For instance, starting from a report showing sales per state and year, clicking on 2013 would trigger a query showing sales per state and month of 2013, while clicking on Florida would trigger a query showing sales per Florida cities and year. In sessions based on this template, hierarchies are progressively navigated by choosing a hierarchy h, a member v of the current group-by level l and creating a new query with selection predicate l=v and group-by on the level l' that precedes l within h.
  2. Slice-All. Users are sometimes interested in navigating a cube by slices, i.e., in repeatedly running the same query but with different selection predicates. In sessions based on this template, a level l of the group-by of the seed query is chosen, and new queries are generated by keeping the same group-by and adding selection predicates on the different members of l. For instance, starting from a query asking for the monthly sales by state for the video department, the subsequent queries could ask for the same report for the audio, the photo, and the PC departments.
  3. Explorative. Some queries may return reports that are particularly interesting for most users, for instance because they show unexpected results (e.g., they show that the impact of a social policy is not the one that had been predicted) or have a strong impact on business (e.g., they show that the level of qualified employment in a given area is extremely low, which requires a corrective action to be taken). We call them surprising queries. The motivation for this template is the assumption that several users, while exploring the cube in search of significant correlations, will be "attracted" by one surprising query. So, sessions based on this template tend to converge "near" to one of the surprising queries, then they evolve casually. Note that the overall number of surprising queries is fixed by a global parameter, while each surprising query is randomly generated.
  4. Goal-Oriented. Sessions of this type are run by users who have a specific analysis goal, but whose OLAP skills are limited so they may follow a complex path to reach their destination. All the goal-oriented sessions starting from the same seed query q end in the same (randomly-generated) query p, but the sequence of OLAP operations to be applied to reach p from q is generated randomly.

For further details, please read the CAISE 2014 paper.