Toward an analysis of datawarehouse and business intelligence challenges - part 2

(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

This is the second half of an analysis, and with my first post on the topic constitutes my first swipe and listing the current fundamental challenges of the datawarehousing and business intelligence fields. The list is in no particular order and will surely change in the future. It is conceived as the beginning of a framework from which to evaluate new or maturing technologies and architectures from the perspective of applicability to the field.

Aggregating silo-ed data sources

Silos silos silos. Anyone trying to do data analysis has run into this problem: the data exists, but we can't get at it. The technical aspects of this challenge are many (bandwidth, interfaces, and ETL), but it's worth noting that they are usually dwarfed by the cultural and organizational obstacles (default against sharing, departmental rivalries), many of which are in place for good reason (security and permissions concerns, privacy laws).

Representing data in a meaningful way

Historically this feels like one of the least-addressed challenges, but we are finally seeing some serious attention paid to this problem. Challenges in representation of data range from visualization (and the related topic of responsible visualization - as visualization is too often untruthful), to analytical views and tools, through search and guided data exploration.

As we stand, the data in datawarehouses and business intelligence datamarts is too often opaque and misunderstood by most users. Even the most impressive and advanced visualizations and analysis tools (Gapminder, BusinessObjects Explorer, and Qlikview, for example) are still highly guided constructs that are often only applicable to predetermined datasets. We have come a long way (finally) over the last decade, but we have a long way yet to go.

Representing reporting structures

Reporting structures are now fairly well understood, but representing them efficiently in our datawarehouses or BI tools remains a challenge. Some examples of such structures: reporting hierarchies, time-dependency, calculated measures, and derived or dependent characteristics. Challenges revolve around rollup and calculation performance, reorganization due to reporting structure changes, and accessibility to potential users.

Performance

Traditionally this is the "big one" and it is still very much an unsolved problem. Bound by the CAP tradeoff, we are more or less forced to give up either consistency, availability, or partition-tolerance in order to improve performance under constant resources. Two approaches prevail: architectures that give up one or more of the three in exchange for performance, and architectures that attempt to better optimize for the problem-space in order to improve performance while maintaining all three CAP axes. Both are perfectly legitimate approaches, but it will be important to recognize which architectural approach is being pursued in any given product or technology. As a wise person once said, "there is no such thing as a free lunch".

Further complicating matters, there are multiple performance aspects of datawarehouse and business intelligence applications, and we need to be clear which ones we attempt to optimize for. These aspects include query performance (keeping in mind the random access vs. batch/bulk/scan access difference), data loading (ETL) and reorganization, and (in some systems) writeback or changing of data.

Security

Security models pose more of a management problem than a technical problem for datawarehouse and BI applications. Nonetheless, I think they're worth mentioning as a core challenge to keep in mind, just in case someone comes up with a way to make reasoning about security in analytical-processing-oriented datasets less painful.

Data loading

Last but certainly not least, data loading is a perennial headache in datawarehouse and BI systems. The three basic types of data loading (batch, real-time/streaming, and write-back/input) all to some extent conflict with each other. Add to that the complexity of managing a profusion of delta mechanisms (many of which exist for good reason, others of which exist because of careless design) and different interface formats and we've got ourselves a real party. Standardization of interfaces and design practices are the key touchstones of conquering this challenge, but as with many of these challenges, this is more of a human problem than a technical problem.

Conclusion - technical vs. design challenges

If we take one thing away from this enumeration of the challenges of the datawarehouse and business intelligence spaces, I hope it is the fact that most of these challenges are more human in nature than they are technical. They tend to derive from the difficulty in making tradeoff decisions, standardizing interfaces and architectures, identifying and focusing on the problem space, and understanding how people may actually use these systems to greatest effect. Because of this, these challenges are often at least as susceptible to design solutions as they are to pure technical solutions. There is a tendency in the industry to focus on technical answers to these challenges over design answers, perhaps because technical solutions are often more impressive and in some sense physical. I think that's unfortunate.

Toward an analysis of datawarehouse and business intelligence challenges - part 1

(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

Since I work in and around datawarehousing and business intelligence, I've developed notes and thoughts over the years on the key challenges in these areas. New technologies and architectural approaches are drastically changing the landscape of the field and can help to address some of these challenges, but enterprise software vendors and customers are often not aware of new approaches or their applicability to the classic problems of the field, which continue to persist.

I'm starting to compile this list publicly, in what I hope will be a more-or-less living document, because I will start using it to evaluate the applicability of newly maturing technologies (in-memory, non-relational or NO-SQL databases, etc.) and architectures (map-reduce, streaming datawarehousing, etc.) to these old problems. This list is a survey, not an in-depth analysis of the problems. I may provide more in-depth analyses if it seems relevant, but I will more likely look for and point to references where they are available.

This is about half of my initial list and is in no particular order. I'll post the second half of the list shortly.

Data volume

This is a classic datawarehousing problem, often addressed through data modeling, aggregation, or compression. Even though it is one of the oldest problems of the field, it is by no means solved, or even optimally addressed. Enterprises continue to struggle with the cost and technical feasibility of scaling up their datawarehouses, often due to limitations of the underlying database technology.

Data quality

We may seem to be able to procure and store all of the data necessary, but that is no guarantee that the data is correct. This challenge has more to do with data being wrong than with data being misunderstood or semantically misaligned, though this is a related problem. Data quality issues can arise for many reasons including incorrect data at the point of entry, incomplete data, duplicate data, or data that becomes incorrect because of an invalid transformation.

Data consistency

Even when data is correctly stored in a datawarehouse, it may become temporarily inconsistent under certain operations. For example, when deleting or loading data, there may be a period of time when queries can access part of the data being loaded or deleted, but not all of it. This can be thought of as an inconsistent state, and while most datawarehousing tools ensure consistency in some manner, this is an area that may sometimes be traded for better handling of another challenging area. The classic tradeoff is between consistency, loading performance, and query performance.

Semantic integration

An oft-overlooked, but extremely important concept, semantic integration challenges comes in two flavors: Homonyms, meaning data that has the same name but different meanings (your "Revenue" may not be the same as my "Revenue"). Synonyms, meaning data that has the same meaning but different names.

Historical data

Dealing with historical data is a challenge that could be subsumed under other challenges. Usually the problems here are mostly issues of handling volume, master data management (changing dimensions), and semantic integration. However, historical data brings some unique angles to these challenges, including possible relaxation of requirements around performance and detail, as well as new legal, audit, and governance requirements around retaining or discarding specific data-sets.

Unstructured data

Datawarehouses have always focused on structured data, primarily because of a complete lack of tools for handling unstructured data rather than because of a philosophical view that unstructured data does not belong in a datawarehouse. This is not to say that the philosophical view does not exist, but rather that the philosophical view derives from an inability to execute rather than any underlying principle, and so should be ignored in light of new tools.

Unstructured data brings with it design constraints and requirements that do not normally appear in datawarehousing discussions. These include a lack of design-time information about dimensionality, the existence of non-numeric "key figures" (text- or image-based data, for example), document-oriented data, and the need for full-text search. Additionally, the challenge of derived dimensions and measures is strongly related to unstructured data, as these are key tools for allowing us to derive structured reporting and analysis from unstructured data-sets.

Downloads for Developers

The news here isn’t that the 'new king-makers', as Savio put it, look a lot like the old kingmakers: developers. The news is that management may finally be realizing it.

Stephen O'Grady, Redmonk

Developers, developers, [...] developers

- Steve Balmer

Most software platform companies at least partially get it these days - developers drive adoption and quality among technical groups. The software that has quality developers on its side will look better to both business and technical interest groups than the same software that is dragged down by developer indifference or animosity.

These points can be debated to a certain extent.

There are plenty of non-technical power-centers that drive adoption of software platforms in the enterprise. Preferred-vendor arrangements are common and historically were often negotiated at the CIO or higher level, with little developer involvement. Further, application vendors attempt (often with good success) to sell into business units rather than IT groups.

But in both cases developers still drive quality and adoption. Business units that buy applications directly often find themselves in need of connections to other systems or extensions to the application. This means developers are involved, either via an IT group, or as outside consultants.

Meanwhile, preferred-vendor agreements are constantly undermined and even when they are successful they may very well promote homogeneity and management ease at the expense of long-term quality in people and software. In order to make good software development and vendor management decisions, one must be well aware of the world beyond a single vendor bubble.

In order to bring developers on board with a vendor's offerings, to increase general awareness, and to drive sales, vendors need to get their software into the hands of developers. In the case of open source vendors, this is mostly an issue of getting the word out, as the software itself is only a download away. But in the case of more traditional enterprise vendors this can be a complicated proposition. Most enterprise vendors now provide downloads of some version of most of their platform software. Some vendors provide downloads of much of their application software as well.

Some example download sites:

These downloads are usually made under a fairly restrictive license and are usually not available for all parts of the application or platform software. Because of vendors' business model it is somewhat costly to provide these downloads because they appear in a format that is not the standard distribution format for the vendor's software. There are also legal costs associated with writing and maintaining the developer licenses that are applied to these downloads.

I believe that vendors have a tendency to see these sorts of downloads as an overhead cost. They are not. They are a key step in driving both sales and developer adoption, which are closely linked.

Here's how:

  • Ability to prototype before purchasing is a key part of the software selection process for responsible companies.
  • Today's developers will guide tomorrow's purchasing decisions.
  • A healthy developer ecosystem is necessary condition for a strong third-party application ecosystem.
  • A skilled, and preferably large, pool of developers is necessary for good project success rates.

In his article "The CIO is the last to know", Billy Marshall talks about the CIO of a financial services company who is surprised to find that his operations people are running Red Hat Linux. This CIO was handed a decision via bottom-up fiat. It is a story that is played out again and again in the enterprise space.

The point is not that CIOs aren't doing their jobs. It's that the decision is inevitably influenced from a different level: the level of those actually carrying out development and operations. Maybe these people aren't actually making the purchasing decisions, but they talk to the people who are. And if someone makes a purchasing decision that development and operations disagree with or are unable to execute, that person is going to hear it. And they'll probably feel it when their group's productivity falls off a cliff.

CIOs either are or should be listening to their developers' opinions. It would be wise for enterprise vendors to divert some sales attention into making sure that those developers have good opinions of their software.

The first step is getting that software into developers' hands quickly and with a minimum of developer effort.

SAP's HANA and "the Overall Confusion"

I threw together a very long response to a very long question on the SCN forums, regarding SAP's HANA application and its impact on business intelligence and datawarehousing activities. The original thread is here and I'm sure it will continue to grow. But since my response was pretty thorough and contains a ton of relevant links, I thought I would reformat it and post it here as well. In order to get a good overview of the HANA situation, I strongly recommend that anyone interested check out the following blogs and articles by several people, myself included:

Some of these blogs are using out of date terminology, which is hard to avoid since SAP seems to change its product names every 6 months. But hopefully if you read them they will give you some insight into the overall situation unfolding around HANA. With regards to DW/BI and HANA, these blogs address many of those issues as well. Now, to try answering the questions:

1. Does SAP HANA replace BI?

It's worth noting that HANA is actually a bundle of a few technologies on a specific hardware platform. It includes ETL (Sybase Replication Server and BusinessObject Data Services), Database and database-level modeling tools (ICE, or whatever it's called today), and reporting interfaces (SQL, MDX, and possibly bundled BusinessObjects BI reporting tools). So, in the sense that your question is "does anything change as far as needing to do ETL, modeling, and reporting work to develop BI solutions?", then the answer is no. If you are asking about SAP's overall strategy regarding BW, then this is open to change and I think the blogs above will give you some answers. The short answer is that I see SAP supporting both the scenario of using BW as a DW toolkit (running on top of BWA or HANA) as well as the scenario of using loosely coupled tools (HANA alone, or the database of your choice with BusinessObjects tools) for the foreseeable future. At least I hope this is the case, as I think it would be a mistake to do otherwise.

2. Will SAP continue 5-10 years down the road to support "Traditional BI"?

I hope so. If you read my last blog listed above you will see that HANA actually solves none of the traditional BI problems, and addresses only a few of them. So we still need "traditional" (read "good old hard work") approaches to address these problems.

3. What does this mean for our RDBMS, meaning Oracle?

Very interesting question. For a long time, SAP has supported competitive products to Oracle offerings. In my view, this was to give SAP and its customers options other than the major database vendors, and to give itself an out in the event that contract negotiations with a major vendor went south. So in a sense, HANA can be seen as maintaining this alternative offering. Of course, SAP says HANA is more than that, and I think they are right. Analytic DBMSes have been relatively slow catching on and as SAP's business slants more and more towards BI, the fact is that the continued use of traditional RDBMSes in BI and DW contexts has done a lot of damage by making it difficult to achieve good performance. It's a lot easier to sell fast reports than slow reports :-) So that is another driver. Personally, I don't agree with SAP's rhetoric about HANA being revolutionary or changing the industry. The technologies and approaches used in the ICE are not new, as far as I have seen. As far as changing the industry from a performance or TCO perspective, I'm reserving judgement on that until SAP releases some repeatable benchmarks against competing products. I doubt that HANA will significantly outperform competitive columnar in-memory databases like Exasol and ParAccel. If you are Oracle, you have a rejuvenated, and perhaps slightly more frightening competitor. I don't think anyone really thought that MaxDB was a danger to Oracle, but HANA holds more potential as a competitor to Exadata. Licensing discussions could get interesting.

4. Is HANA going to be adopted and implemented more quickly on the ECC side than BI side first?

Everything I have seen has indicated that SAP will be driving adoption in BI/Analytic scenarios first and then in the ECC/Business Suite scenario once everyone is satisfied with the stability of the solution. Keep in mind, the first version of HANA is still in ramp-up. SAP is usually very conservative in certifying databases to run Business Suite applications.