Proteus is a database engine designed for today's heterogeneous environments. Proteus adapts to variable data, hardware and workloads through a combination of GPU acceleration, data virtualization, and adaptive scheduling.

Fast Queries Over Heterogeneous Data Through Engine Customization

VLDB 2016. M. Karpathiotakis, I. Alagiannis, A. Ailamaki


Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of heterogeneous datasets to gain insights. The different data models and formats pose a significant challenge on performing analysis over a combination of diverse datasets. Serving all queries using a single, general-purpose query engine is slow. On the other hand, using a specialized engine for each heterogeneous dataset increases complexity: queries touching a combination of datasets require an integration layer over the different engines.

This paper presents a system design that natively supports heterogeneous data formats and also minimizes query execution times. For multi-format support, the design uses an expressive query algebra which enables operations over various data models. For minimal execution times, it uses a code generation mechanism to mimic the system and storage most appropriate to answer a query fast. We validate our design by building Proteus, a query engine which natively supports queries over CSV, JSON, and relational binary data, and which specializes itself to each query, dataset, and workload via code generation. Proteus outperforms state-of-the-art opensource and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format, all while exposing users to a single query interface.

  author    = {Manos Karpathiotakis and
               Ioannis Alagiannis and
               Anastasia Ailamaki},
  title     = {Fast Queries Over Heterogeneous Data Through Engine Customization},
  journal   = {Proc. {VLDB} Endow.},
  volume    = {9},
  number    = {12},
  pages     = {972--983},
  year      = {2016},
  url       = {},
  doi       = {10.14778/2994509.2994516},
  timestamp = {Sat, 25 Apr 2020 13:58:55 +0200},
  biburl    = {},
  bibsource = {dblp computer science bibliography,}