It's difficult to pay for good information these days

Stephen Few recently published a blog ​that takes Forrester to task over the value of its publications on the topic of visual business intelligence, and by extension on all topics that Forrester weighs in on. Meanwhile, over on ZDNet, we've got Ed Bott asking why the IT industry continues to listen to Gartner.

​I'm not particularly comfortable with the personal tone that Stephen Few takes in some of his diatribes, but both of these posts are fundamentally correct that the big IT analyst firms are strikingly bad at predicting the future and at evaluating the current state of the market.​ I'll venture that most people think these are exactly the two skills that analyst firms are selling in their reports, and so it is rather surprising that analyst firms like Forrester and Gartner are no better (and often worse) than the rest of us in these areas.

​Of course, this isn't actually what these firms are doing with these reports. As far as I can tell, there are three main customers of large IT analyst firm reports:

  1. ​Large enterprises (i.e. enterprise software customers)
  2. Consulting firms​
  3. Vendors​

​Notably missing from the list are experts in the fields that these firms are covering. Yes, some of these experts work at consulting companies, vendors, and customers, but they make up a surprisingly small percentage of the employees of these companies. I don't think these are the people buying the reports.

My take: Software customers buy these reports to justify buying decisions. This is the 21st century version of "no one gets fired for buying IBM", but now it reads "no one gets fired for buying a leader on the Gartner Magic Quadrant". Consulting firms buy these reports so that they appear knowledgeable and know what products they can recommend to their clients who are reading the same reports. Vendors buy distribution rights to the reports that list them as industry leaders so they can use the reports to reassure their nervous customers during the extensive sales process.

The emperor has no clothes and everyone already knows it, including the emperor. It's just convenient for everyone to keep their mouths shut about the topic. Stephen Few and Ed Bott are hardly the naive child in this scenario. They have not suddenly realized that the emperor has no clothes, but pointing out this fact every once in a while is a good way to advertise that they are selling something different.

​Meanwhile, IT analyst firms churn out reports that are designed to be incredibly conservative, informed by research and sales processes that inherently support the status quo. Of course, they pretend to be cutting edge with an eye on disruptive innovation. They've got to, because the idea of disruptive innovation has become the status quo. But in reality it is almost impossible for one of these reports to recognize and recommend a disruptive technology because the very methodology of the reporting process precludes the analysis of truly disruptive vendors.

​It's a sad situation, especially as I know some really bright minds work for these firms. These are  people who do understand the industry and do have a feeling for disruptions and innovations. I'm quite glad that Gartner and Forrester now have many of their analysts blogging on these topics, because this is a platform where these people can actually give us a more honest idea of what's going on. But it is telling that when independent analysts join these firms, their blogging tends to drop off as more of their time is plowed into the production of reports and client inquiries.

​As it is, I now primarily follow independent analysts (like Curt Monash or Horace Dediu)  and commentators for informative reports on the BI, mobile, and data spaces. And I end up doing a lot of my own research. It seems to me that it's unfortunately difficult to pay for good information these days, so you're often better off not paying.

Toward an analysis of datawarehouse and business intelligence challenges - part 2

(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

This is the second half of an analysis, and with my first post on the topic constitutes my first swipe and listing the current fundamental challenges of the datawarehousing and business intelligence fields. The list is in no particular order and will surely change in the future. It is conceived as the beginning of a framework from which to evaluate new or maturing technologies and architectures from the perspective of applicability to the field.

Aggregating silo-ed data sources

Silos silos silos. Anyone trying to do data analysis has run into this problem: the data exists, but we can't get at it. The technical aspects of this challenge are many (bandwidth, interfaces, and ETL), but it's worth noting that they are usually dwarfed by the cultural and organizational obstacles (default against sharing, departmental rivalries), many of which are in place for good reason (security and permissions concerns, privacy laws).

Representing data in a meaningful way

Historically this feels like one of the least-addressed challenges, but we are finally seeing some serious attention paid to this problem. Challenges in representation of data range from visualization (and the related topic of responsible visualization - as visualization is too often untruthful), to analytical views and tools, through search and guided data exploration.

As we stand, the data in datawarehouses and business intelligence datamarts is too often opaque and misunderstood by most users. Even the most impressive and advanced visualizations and analysis tools (Gapminder, BusinessObjects Explorer, and Qlikview, for example) are still highly guided constructs that are often only applicable to predetermined datasets. We have come a long way (finally) over the last decade, but we have a long way yet to go.

Representing reporting structures

Reporting structures are now fairly well understood, but representing them efficiently in our datawarehouses or BI tools remains a challenge. Some examples of such structures: reporting hierarchies, time-dependency, calculated measures, and derived or dependent characteristics. Challenges revolve around rollup and calculation performance, reorganization due to reporting structure changes, and accessibility to potential users.

Performance

Traditionally this is the "big one" and it is still very much an unsolved problem. Bound by the CAP tradeoff, we are more or less forced to give up either consistency, availability, or partition-tolerance in order to improve performance under constant resources. Two approaches prevail: architectures that give up one or more of the three in exchange for performance, and architectures that attempt to better optimize for the problem-space in order to improve performance while maintaining all three CAP axes. Both are perfectly legitimate approaches, but it will be important to recognize which architectural approach is being pursued in any given product or technology. As a wise person once said, "there is no such thing as a free lunch".

Further complicating matters, there are multiple performance aspects of datawarehouse and business intelligence applications, and we need to be clear which ones we attempt to optimize for. These aspects include query performance (keeping in mind the random access vs. batch/bulk/scan access difference), data loading (ETL) and reorganization, and (in some systems) writeback or changing of data.

Security

Security models pose more of a management problem than a technical problem for datawarehouse and BI applications. Nonetheless, I think they're worth mentioning as a core challenge to keep in mind, just in case someone comes up with a way to make reasoning about security in analytical-processing-oriented datasets less painful.

Data loading

Last but certainly not least, data loading is a perennial headache in datawarehouse and BI systems. The three basic types of data loading (batch, real-time/streaming, and write-back/input) all to some extent conflict with each other. Add to that the complexity of managing a profusion of delta mechanisms (many of which exist for good reason, others of which exist because of careless design) and different interface formats and we've got ourselves a real party. Standardization of interfaces and design practices are the key touchstones of conquering this challenge, but as with many of these challenges, this is more of a human problem than a technical problem.

Conclusion - technical vs. design challenges

If we take one thing away from this enumeration of the challenges of the datawarehouse and business intelligence spaces, I hope it is the fact that most of these challenges are more human in nature than they are technical. They tend to derive from the difficulty in making tradeoff decisions, standardizing interfaces and architectures, identifying and focusing on the problem space, and understanding how people may actually use these systems to greatest effect. Because of this, these challenges are often at least as susceptible to design solutions as they are to pure technical solutions. There is a tendency in the industry to focus on technical answers to these challenges over design answers, perhaps because technical solutions are often more impressive and in some sense physical. I think that's unfortunate.

Chin scratcher in SAP BusinessObjects Mobile

Here's an interesting one.​

​

​What is the point of the slider control at the bottom of this chart? This type of control is only useful for time series or similar arrangements, but one would almost always use a line-, not a bar-chart, to display this type of information. Yet it appears to be the default configuration for bar charts in SAP BusinessObjects Mobile and it is used as seen above in the demo application that SAP provides.

​I'm stumped.

Toward an analysis of datawarehouse and business intelligence challenges - part 1

​(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

Since I work in and around datawarehousing and business intelligence, I've developed notes and thoughts over the years on the key challenges in these areas. New technologies and architectural approaches are drastically changing the landscape of the field and can help to address some of these challenges, but enterprise software vendors and customers are often not aware of new approaches or their applicability to the classic problems of the field, which continue to persist.

I'm starting to compile this list publicly, in what I hope will be a more-or-less living document, because I will start using it to evaluate the applicability of newly maturing technologies (in-memory, non-relational or NO-SQL databases, etc.) and architectures (map-reduce, streaming datawarehousing, etc.) to these old problems. This list is a survey, not an in-depth analysis of the problems. I may provide more in-depth analyses if it seems relevant, but I will more likely look for and point to references where they are available.

This is about half of my initial list and is in no particular order. I'll post the second half of the list shortly.

Data volume

This is a classic datawarehousing problem, often addressed through data modeling, aggregation, or compression. Even though it is one of the oldest problems of the field, it is by no means solved, or even optimally addressed. Enterprises continue to struggle with the cost and technical feasibility of scaling up their datawarehouses, often due to limitations of the underlying database technology.

Data quality

We may seem to be able to procure and store all of the data necessary, but that is no guarantee that the data is correct. This challenge has more to do with data being wrong than with data being misunderstood or semantically misaligned, though this is a related problem. Data quality issues can arise for many reasons including incorrect data at the point of entry, incomplete data, duplicate data, or data that becomes incorrect because of an invalid transformation.

Data consistency

Even when data is correctly stored in a datawarehouse, it may become temporarily inconsistent under certain operations. For example, when deleting or loading data, there may be a period of time when queries can access part of the data being loaded or deleted, but not all of it. This can be thought of as an inconsistent state, and while most datawarehousing tools ensure consistency in some manner, this is an area that may sometimes be traded for better handling of another challenging area. The classic tradeoff is between consistency, loading performance, and query performance.

Semantic integration

An oft-overlooked, but extremely important concept, semantic integration challenges comes in two flavors: Homonyms, meaning data that has the same name but different meanings (your "Revenue" may not be the same as my "Revenue"). Synonyms, meaning data that has the same meaning but different names.

Historical data

Dealing with historical data is a challenge that could be subsumed under other challenges. Usually the problems here are mostly issues of handling volume, master data management (changing dimensions), and semantic integration. However, historical data brings some unique angles to these challenges, including possible relaxation of requirements around performance and detail, as well as new legal, audit, and governance requirements around retaining or discarding specific data-sets.

Unstructured data

Datawarehouses have always focused on structured data, primarily because of a complete lack of tools for handling unstructured data rather than because of a philosophical view that unstructured data does not belong in a datawarehouse. This is not to say that the philosophical view does not exist, but rather that the philosophical view derives from an inability to execute rather than any underlying principle, and so should be ignored in light of new tools.

Unstructured data brings with it design constraints and requirements that do not normally appear in datawarehousing discussions. These include a lack of design-time information about dimensionality, the existence of non-numeric "key figures" (text- or image-based data, for example), document-oriented data, and the need for full-text search. Additionally, the challenge of derived dimensions and measures is strongly related to unstructured data, as these are key tools for allowing us to derive structured reporting and analysis from unstructured data-sets.