Skip to content
H

HeroDB

Hero DB is a conceptual Database Management System that is sensitive to the underlying hardware system. The complete DBMS will be able to support efficient processing over an arbitrary heterogeneous H/W environment.

An architecture for DBMS over heterogeneous computing resources. The system comprises of a decoupled architecture, where the lower-level devices are abstracted from the higher-level applications. The execution trickles down across the layers and each layer performs optimizations of their own providing optimal execution overall.

The overall architecture envisioned

HeroDB_arch

Components

There are majorly two layers for execution:

  1. Data-Processing
  2. Task-Based Runtime

Data Processing - Layer 2

This layer takes care of performing logical operation placement and device selection in an abstracted view. Here, the SQL (given as a Primitive Query Graph) is transformed into a primitive-execution sequence based on the available primitives in layer 1 (given as task family). Once the execution ready primitives are decided, they are then optimized for functional or data-parallel execution. After optimization, we get a complete execution plan of the SQL (primitive query graph). Each of the nodes (primitives) in the graph are finally bundled with execution ready information ( data names, parameter information, tagged device) and forwarded to layer 1. Based on the information above and the execution steps, below are the modules available within the data processing layer.

Data_processing_layer

task-based runtime - Layer 1

In the task-based runtime, each of the bundled task given by layer 2 is executed. Given a task, the executor system takes care of data transfer (if any), result set generation and setting up of the operation. Once the preliminaries are done, the layer takes care of allocating resources for execution and finally executing the operation.

The layer also has minimal load balancing properties for execution. The layer can, for example, defer execution of operation when it sees there is already an operation occupying the device. The deferred operation can be executed in another device or in the same device but, delayed time.

Overall, Layer 1 takes care of runtime - taking care of heterogeneous devices, their memories and IO - that executes any given task. Here, a task can be any finite data processing computation performed on any processing resources. A task is always a part of task family.

Component: Execute

The execute component takes care of handling OpenCL for data processing. The component is responsible for:

  1. Setting-up and monitoring executable devices
  2. Queueing data and handling data transfers
  3. Monitoring operation performance

This is one of the components that is called during initial system setup. The component should have an initialize() function to sense all the available devices and setup their respective OpenCL context and command queues. Also, this component also initilaizes monitoring information for each of the devices. Currently, below are the details of a device monitored:

  1. Available memory space left
  2. No. of parallel work items available
  3. Number of currently running jobs

The component also provides execute() function, that is called for executing an operation under a given device. The execute() require:

  1. Device Queue to execute

Since, it is highly beneficial to process a series of tasks in a device packed in its queue and also that OpenCL allows lazy evaluation, where the enqueued tasks are only executed when clfinish() function is called, the execute function gets the device queue rather than individual tasks. However, to add into the device queue, we require another function - QueueTask() taking in the parameters:

  1. Device ID ( = Given by OpenCL )
  2. Data information
  3. Task information

Furthermore, the data and task information bundles a set of information.

For Data information, we currently require:

  1. ID
  2. Current Device Location

Note: a data pointed in the data information must be available in OpenCL and NOT in C++

To manipulate data in any device, we provide a minimal set of API

a. add_data(ID, *ptr, device, size) b. get_data(ID, device, size) c. print_data(ID, device, size, print_binary = false) d. transfer_data(ID, src_device, trgt_device, size) and so on.

Similarly, a task will also have to be first available in OpenCL by compiling the CL function string. A similar set of API is also available for kernel preparation.

a. add_kernel(kernel_name, device, kernel_string) b. <>

Any component wants to execute a task, must first advertise their data and operation to OpenCL using these APIs followed by calling the execute function with the task and data bundle.

The complete execution flow for 'execute' component is available here:

execute_operation_flow

Components of task-based runtime are:

  1. Task manager
  2. Object manager

Task manager

Task manager aids in scheduling and placement decisions given a primitive.

Object manager

Tracks all available resources and provides cost models for data transfer and possible execution time in given resource.

Another major task for the layer is performing a light-weight real-time load balancing.

Major functions

Load balance: Task overallocation on a single device leads to overutilization of the device. As a consequence, the operators in the waiting queue cannot be processed in the estimated time as well as another available device might be free but remains idle as the queue is not populated.

Hence, when the given device is busy, we have two options to choose from:

  1. wait until the device is ready to process the operators
  2. execute in another device, even it is less efficient for the given operator.

Optimization: Though the upper level provides us with the detail of the optimized placement, still the amount of resource to be allocated is still to be identified. In this step, the optimizer in each of the device takes care of selecting the optimal resource set for execution of the task.

Synchronization: This basically takes care of data dependency and transfer across devices

Adaptation: Dynamic planning of resource allocation. This loops around and checks constantly if any new resource is being added or if the existing resource is removed. This is in tight interaction with layer 0 (not in the wiki as of now)

Interface to other layers

Since the layers are compartmentalized, the interaction between the layers is done using interfaces. Layer 0 provides device information and layer 2 provides execution object information. These are consumed using the below-listed interfaces.

Task Execution Interface (TEI)

this is the interface via which the data processing layer provides input to layer 1. The interface consumes a task bundle graph, bundled with:

  1. static attributes (parameters)
  2. required device
  3. expected execution cost estimated
  4. IO references (where the input data is residing)
  5. sync dependencies ( if the operator requires the result of another operator)

Load Introspection Interface (LII)

provides information about the current physical resource availability.

Sample use-case

performing conjunctive selection across devices: (A<100)&&(B>100)

Layer 2

pre-processing: get available device information from layer 1 ( object manager)

planning - steps

Given a query the planner provides query execution plan for the next level

Transformation - steps

Search the PQG (primitive Query Graph) for structural patterns - allow graph transformation

Possible graph transformations:

  1. replacing a single task node with a sequence of tasks. E.g. Aggregate node = map-> prefix sum -> aggregate
  2. replacing a series of tasks with a single task node. E.g. map(+) -> map() = map(+,)

return: task graph

Optimization - steps

Selection of appropriate device and task implementation for execution.

optimization is done for:

  1. less resource consumption
  2. execution costs - execution time + data transfer time + waiting time
  3. parallelism - data and functional parallelism

Here, for data parallelism - a single operation is executed across multiple devices.

Execution Plan

Once planning decisions are done, the given query plan is added with execution related information:

  1. execution task source
  2. data source
  3. static parameters
  4. runtime parameters

Here, it might happen that the input might not fit completely in a device. In this case, the execution task is bundled with a chunk of input data and the same task is spawned in the pipeline with the next chunk of data.

This allocated bundled task graph is forwarded to next layer (Task execution layer) using TEI - interface

Other general information

During initialization, the below information is required to be stored in a lookup table

  1. operations / primitives available for execution
  2. Data / table values available to be queried
  3. Device / execution related information

tasks and task family

A task family captures the general information for any given DBMS primitive. A family holds the information about the general characteristics of operation and each task in the task family contain the exact execution related information.

Currently, all the information in the task family is captured for a single runtime environment (i.e. OpenCL) but this can be easily extended for any arbitrary execution environment (CPP, CUDA, etc. )

These are wrappers providing progressively more information in each of the lower layers. The terminologies used for these are:

  1. Task Family
  2. Task
  3. Kernel bundle

Due to the declarative format of SQL, the end-user is exposed only with the primitive names that are to be used in the query execution plan. The user formulates the query execution plan as a DFG (data flow graph) with these exposed primitives as the nodes and the first incoming edges with the input columns. We call this input graph from the user as Query Execution Graph (QEG)

Note: We expect the formulated QEG to be semantically correct with the dependencies of the incoming node met with the resulting nodes.

Task Family: This bundles all the possible tasks that are semantically providing the same operation. These tasks consume a set of operation and resulting in a pre-defined set of results. However, each of the task within the family can be differentiated based on:

  • Implementation
    • Algorithm
    • Code structure
  • Target device

It is also possible, a single implementation can be used in multiple devices and so is for the vice versa. Hence, a task is identified by the combination of their implementation with the target device.

Task_family

Task: As previously shown, a task captures a single snapshot of implementation for a given target device. Since each of the tasks is different from others in the family, pre-processing is necessary for preparing the necessary intermediate data structures and also for preparing the intermediate result set before execution is commenced. Consequently, we also require post-processing to clear the data structures not necessary anymore. Due to this, all the tasks in the task family extend an abstract class 'AbstractTask' that defines the template for any executable task. The class encapsulates

Properties:
  • Input definition: First, a set of predefined (TBD) input arguments that defines specific properties to the incoming nodes. This also helps in qualifying the input for the given task has the required properties.

  • Parameter information: the default ones are the parameters required for processing the function itself. Other than these, there can be many user-defined parameters that help in pre-processing of the input.

  • Output definition: Each of the nodes in the QEG forwards a set of data as a result. These are defined within this section. As an exception, only the end node that prints the results to the user contains no output definition present.

Furthermore, the input and output definitions are overloaded with a set of properties necessary for preprocessing the data directly or setting up the kernels indirectly.

Abstract functions:

  • construct() - here the preprocessing steps like setting up data structures and defining the number of parallel work-items spawned are performed.

  • evaluate() - Execution call for the node is given here

  • destruct() - clean up tasks done here

Device Execution Layer - Layer 0

The layer only works on the execution of the given operation and data set. This layer is run until completion or breaks the execution when there is any runtime error in the execution process. The layer does not perform any optimization. The major work in this layer is to prepare for execution readiness for the given tasks. Main functions of the layer are:

  1. compiling and installing kernel into the device
  2. adding data into the device. This can be either replacing a data pre-existing or creating a new memory space adding data into the location.
  3. Execution of tasks or queues of tasks. Returns events or some captured properties for the given tasks.

Functionalities provided

Environment: This is a helper function that identifies the different OpenCL based devices and set up their respective platform and context.

Runtime: This contains the execution and monitoring functions. These are called only after pushing both the executable kernel and data into the device.

DataMapper: provides different data-related functions. These can push, delete or pull data from the device

KernelMapper: provides different kernel-related functions. These can compile and delete the given operation into the device.

Properties provided

Along with the executables, the layer also provides many property values dependent on these layers. The layer provides an identifier for selecting the below-given properties from OpenCL

  • platform
  • device
  • command queue
  • data identifier
  • kernel identifier

FPGA

FPGA will provide functions for data

  • readDataBuffer
  • writeDataBuffer

data and functions are given as data flow graphs