Tuesday, April 12, 2011

NoSQL Options in Analytics and Data Warehousing

(Here, data warehousing serves as a guiding example for a generic business application)

Recent years have seen many initiatives to ease up ACID properties in order to translate the gained freedom into other benefits like better performance, scalability or availability. Typically, the result of such approaches are trade-offs like the ones manifested in the CAP theorem. The latter provides a systematic and theoretically proven way to look at an option for a certain trade-off: mostly to sacrifice consistency in order to assert availability and partition tolerance.

These are purely technical properties that are generic and applicable to all applications for which availability and (network) partitions constitute a huge (economical) risk. However, an application (that runs on top of a high volume or large scale DBMS) itself frequently provides a number of opportunities to simplify the liability on ACID. One particular example is the software that manages tables, data flows, processes, transformation etc. in a data warehouse. One major goal of such software is to expose data that originates from multiple source systems – each of which presumably consistent for itself – in a way that makes sense to the end user. Here, "making sense" means (a) that the data is harmonized (e.g. by transformations, cleansing etc.) but also (b) that it is plausible*.

What does the latter mean? Let’s consider the following example: typically, uploads from the various source systems are scheduled independently of each other. However and frequently, scenarios require the data from all relevant systems to be completely uploaded to provide a consistent (plausible) view. A typical example is the total costs of a business process (e.g. a sales process) are only complete (i.e. "consistent" or "plausible") if the costs of the respective sub-processes (e.g. order + delivery + billing) are uploaded to the data warehouse. A possible consequence of such a situation is that costs are overlapping or only partially uploaded. This translates into a non-plausible effect to the end user in the sense that – depending on the moment when he looks at the date – varying amounts for the total costs appear. Normally, such a result is technically correct (i.e. consistent) but not plausible to the end user.

Frequently, such effects are tackled by implementing plausibility gates that allow data to proceed to the next data layer of a data warehouse only if all the other related data has arrived. In other words: plausibility in this context is a certain form of consistency, namely one that, for instance, harmonizes consistency across multiple (source) systems. A plausibility gate is then a kind of managed COMMIT to achieve a plausible state of the data in the next data layer. It is a perfect example of a property or fact that exists within a business application and that can be exploited to manage consistency differently and in a more performance optimal way. For example: individual bulk loads can be considered as part of a wider "(load or data warehouse) transaction" (which comprises all related loads) but do not have to be considered in an isolated way. Here is clearly an option to relieve some constraints. The source of this relief is to be found in the business process and its sub-processes that underlie the various loads.

In the context of SAP's In-Memory Appliance (SAP HANA), SAP is currently investigating such opportunities within its rich suite of business applications. Leveraging such opportunities is considered and expected to be one of the major sources for scaling and increasing performance beyond the more generic and technology-based opportunities like main memory, multi-core (parallelism) and columnar data structures. We suggest looking into options within the classic world of business applications in order to mimic what has been successfully implemented in non-classic applications, especially in the Internet-scale area where NoSQL platforms like Hadoop have been successfully adopted.

PS (27 May 2011): By coincidence, I've come across this article which describes the consistency problem within a sales process. One example is that the customer expects to be treated equally and independent from which sales channels he/she have used, like returning a product in a high street shop even if it has been bought online. The article talks about the consistency problem in the context of sales management software. However, this is exactly what happens also in a DW context and can therefore be easily translated.


*The term plausible is used in order to distinguish from but also to relate to consistent.

1 comment:

  1. I'm a community leader on a network of developer websites.  I really liked your blog content and thought you might be interested in some extra exposure on our sites.  Send me an email at ross [at] dzone [dot] com and I can explain all the details.

    ReplyDelete