Blending Efficient Ingestion and Querying

Motivation: Reduce the intake time in financial/insurance data pipelines

The ability to make informed decisions has become a differential asset for the organizations’ success. Companies must collect high volumes of data from several sources and process it to extract information. This fact is most important for some departments, such as the actuaries, risk control or debt recovery, and others who need this information to perform normal activities. Thus, data-driven organizations require data stacks that provide the results from heavy data processing within a limited time window.

However, this creates extra pressure on databases that must reconcile two opposing features: Fast data ingestion and efficient querying. The current dichotomy splits the database spectrum into two different categories: NoSQL and SQL:

  • One advantage of NoSQL technology, and more particularly, key-value data stores, is that it is incredibly efficient at ingesting data. However, queries are high inefficient.
  • SQL databases are the opposite: They are very efficient at querying data, but ingest data very inefficiently, and thus slowly.

In a normal financial company, most of these processes use SQL databases, mainly for historical reasons. The relational database’s inefficiency at intaking data highly increases the processing time. Other organization departments require the result of this process within a fixed time window. If the processing time is too long, any unexpected problem results in not completing the process within the time window and creates an organizational risk of preventing the other organization departments from performing their regular tasks properly.

How and why can technical architects accelerate data ingestion with LeanXcale? Let’s examine it in detail.

WHY SQL DATABASES CANNOT PERFORM QUICK DATA INTAKE

Regular SQL databases use as data structure a B+ tree. This underlying data structure allows SQL databases to query efficiently. However, it is more inefficient in data intake. Why? Consider that a common financial data process handles datasets that cannot be maintained in memory, for instance, a 1TB table. The B+ tree would grow to six levels to store all the data. If the database runs on a node with 128 GB of memory, it will hold less than 25% of the data in the cache, typically those nodes from the root and levels close to the root (see Figure 1).

Figure 1: B+ Tree and Associated Block Cache

In this example, for every insertion it must read an average of three data blocks from persistent storage to reach the target leaf node, in our example, this means reading an average of three blocks from persistent storage. These read blocks will all cause the same number of blocks to be evicted from the cache, causing an average of 3 blocks to be written to persistent storage. This means that a single data row is causing 6 I/Os. This is why SQL databases ingest data very slowly.

WHY NOT USE ONLY A NOSQL DATASTORE?

If we only use a NoSQL data store to perform this type of process, we will get an efficient intake, but poor overall performance. Let’s see why.

Regular key-value data stores rely on SSTables that cache updates and write them to disk periodically on a different file, thus optimizing I/Os by writing many rows in one operation. However, queries become very slow. Let’s jump into the details.

Figure 2: SSTable Approach Used by NoSQL Data Stores

In the case of an SStable, a specific horizontal data fragment will be spread across many files, and the data store has to access all of them to execute the range query. This spread causes great inefficiency when reading.

SSTables improve when data is stored as B+ trees in each SSTable, so the search must be performed across several B+ trees.

Let’s assume there are 16 SSFiles, each with a B+ tree, so we need to perform the search of the first key of the range in 16 B+ trees that will be 16 times smaller than a single B+ tree with all the data. Let’s assume each B+ tree has 1024 blocks. So, each search will need to access log (1024) blocks = 10 blocks. Since there are 16 searches, 160 blocks must be read. For a single B+ tree, we would have searched 16384 blocks and read log(16384) = 14 blocks. The NoSQL solution reads 160 blocks, while SQL reads 14 blocks, more than an order of magnitude more blocks.

HOW TO INGEST AND PROCESS QUICKLY? THE LEANXCALE SOLUTION

How does LeanXcale solve the problem? There are several aspects of the answer, but we want to highlight that LeanXcale’s relational key-value datastore is built using a hybrid data structure. This versatile structure is efficient at both ingesting and querying data.

Let’s look at it in detail: This novel data structure is the B+ tree plus the two caches, specialized in reading and writing respectively. The write cache stores all insertions, updates, and deletions of rows, while the read cache is an LRU block cache that stores the most recently read blocks.

Additionally, LeanXcale stores, as in the regular SQL database, the persistent data in B+ trees (see Figure 3). A B+ tree is a search n-ary tree. Data are only stored on the leaves.

Figure 3: B+ Tree

The goal of the intermediate nodes is to enable the logarithmic search.

As shown in Figure 4, the stored keys are the split points of the data on the different child subtrees. With the search for a particular key, sk, one subtree can be chosen at each node of the tree, since it is known that the other subtrees will not contain that key. When sk is higher than k1 but lower than k2, we know the data can only be in the middle subtree. This iteration continues until the leaf node containing the searched row is reached (see Figure 3).

Figure 4: Logarithmic search in a B+ tree

The reason why LeanXcale is efficient is that the nature of the B+ search tree guarantees that the leaf node(s) containing the targeted rows can be reached only reading a logarithmic number of blocks.

LeanXcale uses a write cache to overcome the inefficiency of SQL databases that require multiple IOs to insert a single row (see Figure 5).

Figure 5: LeanXcale B+ Tree and Write and Read Caches

WHAT ARE THE OTHER CHARACTERISTICS THAT SPEED-UP LEANXCALE DATA INTAKING?

Continue reading at LeanXcale blog.

--

--