Tuesday, June 14, 2011

CAP Equivalent for Analytics?

Incidently, I've bumped into Camuel Gilyadov's blog titled CAP Equivalent for Analytics. In analogy to the CAP theorem, he argues, there is a similar trade-off between the following four dimensions in an analytic processing environment, i.e. not all four of them together can be achieved; at least one of them needs to be compromised:
  • sophistication: in simple terms, this refers to the complexity of SQL statements needed for the analysis, e.g. complex joins, multiple sorts etc.
  • volume: this refers to data volume involved in the analysis.
  • latency: here, he means the combination of time to load and transform data (ideally: 0) + query processing time (ideally: sub-second).
  • costs: actually, it's meant to be the costs for hard- and software but I'd add that those costs are a symptom of hard- and/or software architecture complexity.

Tuesday, April 12, 2011

NoSQL Options in Analytics and Data Warehousing

(Here, data warehousing serves as a guiding example for a generic business application)

Recent years have seen many initiatives to ease up ACID properties in order to translate the gained freedom into other benefits like better performance, scalability or availability. Typically, the result of such approaches are trade-offs like the ones manifested in the CAP theorem. The latter provides a systematic and theoretically proven way to look at an option for a certain trade-off: mostly to sacrifice consistency in order to assert availability and partition tolerance.

These are purely technical properties that are generic and applicable to all applications for which availability and (network) partitions constitute a huge (economical) risk. However, an application (that runs on top of a high volume or large scale DBMS) itself frequently provides a number of opportunities to simplify the liability on ACID. One particular example is the software that manages tables, data flows, processes, transformation etc. in a data warehouse. One major goal of such software is to expose data that originates from multiple source systems – each of which presumably consistent for itself – in a way that makes sense to the end user. Here, "making sense" means (a) that the data is harmonized (e.g. by transformations, cleansing etc.) but also (b) that it is plausible*.

What does the latter mean? Let’s consider the following example: typically, uploads from the various source systems are scheduled independently of each other. However and frequently, scenarios require the data from all relevant systems to be completely uploaded to provide a consistent (plausible) view. A typical example is the total costs of a business process (e.g. a sales process) are only complete (i.e. "consistent" or "plausible") if the costs of the respective sub-processes (e.g. order + delivery + billing) are uploaded to the data warehouse. A possible consequence of such a situation is that costs are overlapping or only partially uploaded. This translates into a non-plausible effect to the end user in the sense that – depending on the moment when he looks at the date – varying amounts for the total costs appear. Normally, such a result is technically correct (i.e. consistent) but not plausible to the end user.

Frequently, such effects are tackled by implementing plausibility gates that allow data to proceed to the next data layer of a data warehouse only if all the other related data has arrived. In other words: plausibility in this context is a certain form of consistency, namely one that, for instance, harmonizes consistency across multiple (source) systems. A plausibility gate is then a kind of managed COMMIT to achieve a plausible state of the data in the next data layer. It is a perfect example of a property or fact that exists within a business application and that can be exploited to manage consistency differently and in a more performance optimal way. For example: individual bulk loads can be considered as part of a wider "(load or data warehouse) transaction" (which comprises all related loads) but do not have to be considered in an isolated way. Here is clearly an option to relieve some constraints. The source of this relief is to be found in the business process and its sub-processes that underlie the various loads.

In the context of SAP's In-Memory Appliance (SAP HANA), SAP is currently investigating such opportunities within its rich suite of business applications. Leveraging such opportunities is considered and expected to be one of the major sources for scaling and increasing performance beyond the more generic and technology-based opportunities like main memory, multi-core (parallelism) and columnar data structures. We suggest looking into options within the classic world of business applications in order to mimic what has been successfully implemented in non-classic applications, especially in the Internet-scale area where NoSQL platforms like Hadoop have been successfully adopted.

PS (27 May 2011): By coincidence, I've come across this article which describes the consistency problem within a sales process. One example is that the customer expects to be treated equally and independent from which sales channels he/she have used, like returning a product in a high street shop even if it has been bought online. The article talks about the consistency problem in the context of sales management software. However, this is exactly what happens also in a DW context and can therefore be easily translated.


*The term plausible is used in order to distinguish from but also to relate to consistent.

Friday, April 8, 2011

Is Data like Timber?

For quite some time, I'm playing with this idea of comparing data with some raw material (like timber) and information with some product made from that raw material (like furniture). This helps to suggest (a) that data and information are related but not the same thing and (b) that there is processing inbetween similar to converting wood to furniture. With this blog, I like to throw this comparison out into the public, also to hopefully trigger a good discussion that either identifies weaknesses of this comparison or develops the idea even further.

So let's picture the process on how a tree becomes a cupboard, a book shelf or a table:

  1. Trees grow in the forrest.
  2. A tree is cut and the log is transported to a factory for further processing.
  3. At the factory, it is stored in some place.
  4. It is the subsequently processed into boards. Various tools like saws, presses etc. are used in this context.
  5. The boards are frequently taken to yet another factory that applies various processing steps to create the furniture. Depending on the type of furniture (table, chair, cupboard, ...), a high or a small number of steps, complex or less complex ones are necessary. Additional material like glass, screws, nails, handles, metal joins, paint, ... are added. Processing steps are like cutting, pressing, painting, drilling, ...

Now when you consider what happens to data before it becomes useful information displayed in a pivot table or a chart then you can identify similar steps:

  1. Data gets created by some business process, e.g. a customer orders some product.
  2. For analysis, the data is brought to some central place for further processing in a calculation engine. This place can be part of a data warehouse or of an on-the-fly infrastructure, e.g. via a federated approach that retrieves the data only when it is needed.
  3. At this central place, it is stored in a central place, e.g. persistent DB tables or a cache, where it can potentially "meet" data from other sources.
  4. Data is reformated, harmonised, cleansed, ... using data quality, data transformation tools or plain SQL. Simply consider the various formats for a date like 4/5/2011, 5 Apr 2011, 5.4.11, 20110405, ...
  5. Data is enriched and combined with data from other sources, e.g. the click stream of your web server combined with the user master data table. Only in this combination you can, for instance, tell how many young, middle aged or old people look at your web site. In the end, data has become useful information.
Hopefully, the similarity between the two processing environments has become apparent. In the end, steps 1. - 5. describe a layered scalable architecture (LSA). In one or the other situation, steps will be much simpler or can even be omitted, similar to what type of furniture you want to produce: a book shelf needs less processing than a sophisticated cupboard. I guess that one can now start to play the analogies: operational reporting, i.e. reporting on data from one source, e.g. one single process ("Which orders have been submitted today?"). This is probably tantamount to producing boards or shelves. The log gets cut by a saw, maybe pressed and that's almost it. One can imagine that this is even done real-time, i.e. directly after the tree has been cut. In contrast, producing a sophisticated analysis (i.e. a cupboard) takes more processing steps and, specifically, involves more data from outside (i.e. materials like screws, paint, joins, glass, ...). Similarly, one could spin the idea to find analogies for data warehouses, data mining, dashboards, data marts etc.