The Data Divide

As increasing data volumes have become a given for business over the past 20 years, different technologies have been available to accommodate this growth. In the late 1980s, developers created enterprise business applications to handle the growing business and with it, the growing volume of data. These applications used transactions to manage the different business-related tasks. At this time, computer system main memory was very expensive in comparison to disks. Based on the prices and the trend in prices for memory and disks, IT executives followed the “five minute rule”. The five-minute rules postulated that systems should be equipped with sufficient main memory to hold all data in memory which could be accessed by another transaction within five minutes [Gray1985]. This amount of memory seemed to strike the balance between performance and price.

Any information that exceeded this limit on main memory would need to be moved from memory to a data store. In order to securely store the data from transactions, relational databases seemed to be the perfect choice at that time. Database systems were designed to quickly store lots of data to a cheaper medium than main memory, namely disks. In this constellation, enterprise resource planning (ERP) applications, which run a continuous stream of business transactions and write data back to databases as it exceeds main memory, can be seen as the “killer application” that enabled the massive success of relational databases. However once stored in a database, it was quite difficult to get backto the data. Running a simple query on top of large sets of data in a relational database could easily run for hours, and could drain the resources otherwise accessible to ERP applications.

In this environment E.F. Codd [Codd1993] suggested a split between transactional and analytic systems – OLTP (online transactional processing) versus OLAP (online analytic processing). Codd set up 12 rules that would define the nature of analytic systems, similar to the 12 rules he had postulated to define relational databases 8 years before. One of the core OLAP rules introduced the need for a “multidimensional conceptual view”. This view is defined by the different dimensions, which define the space of a particular business problem.

As an example, businesses typically analyze sales according to the time, product, and geographic location of sales. The definition of these multidimensional views described the divide between transactional and analytic systems caused by fundamentally different data models: while transactional environments typical run on highly normalized data models to avoid insert anomalies and to save space, analytic environments make use of de-normalized data models providing flexibility to optimize query performance.

This divide between OLTP and OLAP has two substantial issues:

  1. Data needs to be moved.
  2. Inherent to data movement is the need to handle deltas.