Notes on web scraping

Well hello there. You may remember me from blogs written well over a year ago! Glad to be back.

Yesterday I tweeted about web scraping, mentioning that it's always a bad idea. What I mean by "web scraping" is the practice of using automated tools to harvest data from a website. Usually this is done in order to try to gather data for research or for use in a product.

That was, I'm sure a bit too strong. After all, universals are always incorrect. But I think it's basically the right approach and I'd like to explain a little bit why that is.

The problems with scraping

Web scraping is problematic for several reasons, among them:

  • Legality, in that scraping is usually prohibited by terms of service, and the use of scraped content is usually a copyright violation baring fair use exceptions. So in scraping you are risking running afoul of wonderful laws like (in the US) the CFAA and DMCA. If you've heard of Aaron Schwartz, you've heard of someone who got slammed with the CFAA for web scraping.
  • Ethics, in that regardless of law it can be unethical to use content that doesn't belong to you, and in scraping you risk bypassing restrictions on content use that may have been put in place intentionally and which you may not even be aware of. Remember, the fact that information is technically accessible, or even public, doesn't mean that it is not sensitive. Example: Your whereabouts in a public street is technically public information, at least in the US. Your neighbor waving at you when you leave your house is great. Someone sitting outside your house 24/7 and tweeting whenever you come and go is grounds for a restraining order.
  • Technical and financial concerns such as denial of service. You don't know what kind of capacity the website you have has available, nor how they are paying for it. It could well be that every request you send costs the website incremental money, or that their server will simply crash due to your scraping. You can't know this ahead of time if you haven't consulted with the person or organization running the website. You can be careful, and you can even respect technical indicators like robots.txt files, but you still won't really know unless you coordinate.

Instructors: Keep in mind that when you indicate to your students that it is OK to web scrape to gather data for their projects, you are opening them up to real liability. Encouraging your students to violate the CFAA is no small thing, no matter how bad a law it is.

What are the alternatives?

First, I'd like to remind us all to keep in mind that we are not generally entitled to use resources we don't own in the service of our own projects. Now, there are certainly situations in which using resources without the owner's permission can be justified, and we'll get to that down around step 6 or 7, but all other things being equal, if there is a better option, you should use it.

My recommendation when looking to acquire bulk data for research purposes is to do the following in this order:

  1. Look for bulk download options provided by the content owner. It is very common to see people talk about scraping sites like Wikipedia, IMDB, Github, or StackOverflow. Guess what! No need, as all of these sites provide bulk data download options.
  2. If no bulk download is available, see if there is an API that might provide what you need within the API terms of use.
  3. Check the terms of service or user agreement to see if by some chance it actually explicitly allows automated scraping in some form. If it does, then go ahead, but this is a rare case.
  4. Email the organization or person running the site and ask if it would be possible to arrange access to the data you are interested in.
    1. You may want to do this well ahead of the time you'll need the data so that there is time for negotiation, legal agreements, etc.
    2. Simultaneously, you may want to ask around among your network and advisors as to whether there is a version of the bulk data available already that you may have missed.
  5. If the organization is not willing to allow use of their data, see if there is an alternative data set that could serve the same purpose. Note that this may require changing your research plan. That's fine! Part of building a good research plan is being able to execute it ethically and legally. It's a good skill to develop.
  6. Consult a lawyer. And an ethicist if there is anything conceivably sensitive about the data you are accessing.
  7. If you've determined that the merits of your need outweigh the legal and ethical issues around ownership and permission, and you are willing to accept the potential consequences of being wrong, scrape.


... some people won't agree with the ethical points above, and keep in mind that I'm not a lawyer, nor do I really know anything about the law in this area. I'm laying out a fairly conservative position. If you disagree and your eyes are open to the possible consequences of being wrong, ranging from a mild slap on the wrist to career suicide and possible jail time, then who am I to stop you?

But if you are an instructor teaching a class, a lead researcher advising your subordinates, or an employee or contractor doing a project for a company, there is more at stake than just yourself. You could well get your students in trouble or subject your company to a damaging lawsuit without realizing that what you are doing is problematic. So think twice and look for alternatives.


Crossfilter has played a pretty big part in my work for the last couple of years. For those who don't know, Crossfilter is a Javascript library that "supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records" in the web browser. It is designed for use in visualizations and applications that allow interaction with and filtering of multiple dimensions of a single data set.

I've found Crossfilter to be a very powerful tool for building applications that support data interactions that feel 'real-time'. The idea is that <100ms interactions are fast enough to allow us to identify patterns and correlations as the visualization changes in exact coordination with our filtering. I use Crossfilter in several places, but the biggest project is easily Palladio, which is a testbed platform for research activities and a powerful example of the possibilities of browser-based data exploration tools.

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Crossfilter consistently trips up people trying to use it for the first time because, I believe, its programming model for aggregations is inconsistent with the model we usually expect. It's for this reason that I started Reductio, a library that helps build up aggregation functions that work correctly and efficiently with Crossfilter's model.

I'm not going to get into all of the details here, but defining aggregations in Crossfilter requires defining functions that incrementally update the aggregation based on the addition or removal of records. This is at odds with the standard way of computing aggregations by building them up from scratch that we see in SQL or more standard map-reduce models.*

// A function that can be passed to Array.reduce to sum the array
function(a, b) {
  return a + b;
// The equivalent operation in Crossfilter requires 3 functions
function reduceAdd(p, v) {
  return p +;

function reduceRemove(p, v) {
  return p -;

function reduceInitial() {
  return 0;

It's these aggregation functions that consistently trip people up. The summation case is simple enough because summation is an operation that is computationally equivalent given the running total and the value to be added or removed. But what about operations that are not reversible? Take the computation of a maximum. When adding a new record to a maximum aggregation, we just need to check if the value of the record is larger than the current largest value we've seen (the current maximum).

// A function that can be passed to Array.reduce to return the max
function (a, b) {
  return Math.max(a, b);

But if we remove a record with a value equal to the current maximum, we need to know the next smallest value. And if we remove the record with that value we need to know the next smallest, and so on. We have to keep track of all values seen in order to avoid needing to rebuild the entire aggregation from scratch when a record is removed. And yes, this is significantly faster than rebuilding from scratch!

// The equivalent operation in Crossfilter (optimized, with
// error guards). Adapted from Reductio
var bisect = { return d; }).left;

function reduceAdd(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 0, v.number);
    p.max = p.values[p.values.length - 1];
    return p;

function reduceRemove(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 1);
    // Check for undefined.
    if(p.values.length === 0) {
        p.max = undefined;
        return p;
    p.max = p.values[p.values.length - 1];
    return p;

function reduceInitial() {
  return { values: [], max: 0 };

It's a lot of work to come up with these aggregation functions, make them efficient, and test them thoroughly, and Crossfilter beginners (myself included) consistently struggle with this. Believe it or not, things get even nastier when dealing with thing like pivot- or exception-aggregations, when building aggregations where a record counts in more than one aggregation group (like moving averages, or tags), or when trying to compose multiple aggregations (commonly required for visualizations as simple as a scatter-plot). Reductio supports all of these aggregations and a few more, and I hope it will allow a quicker, smoother start when using Crossfilter.

* This difference in model is to allow very fast interactions with data and aggregations. If you aren't building an application that requires real-time interactions and are thinking about using Crossfilter simply to do one-time calculations or aggregations on your data, think again about using Crossfilter as it is probably not the right tool for the job.

Skating to where the puck is

Being an SAP Mentor is an interesting situation. The relationship with SAP usually feels amazingly collaborative, but occasionally uncomfortably adversarial. ThIs is probably a healthy place for a relationship with a massive organization to be. Every organization includes many great people and interactions with those people are always delightful. Every organization is also, as an organization, predisposed to get the answers to certain questions consistently right and the answers to other sorts of questions consistently wrong.

One such question that SAP has consistently gotten wrong in the past is how to push technology adoption among a massive install base. Historically, it would appear that SAP has tended to try to monetize upgrades to underlying technology in order to recoup development costs or profit from its technology offerings. Netweaver, BW, Visual Composer, add-ons like SSO and Gateway, Personas, Business Client, and HANA* are all technologies in which SAP has either underinvested due to lack of direct revenue potential (BW especially) or has tried to monetize. The result is that widely deployed "free" technology from SAP often stagnates while advanced technology with an additional cost often sees limited adoption in SAP's install base.

This outcome is terrible for SAP's business. It makes it very difficult for SAP to keep its application offerings competitive in the marketplace. The reason is that basic application functionality to be competitive is often dependent on improvements in underlying technology. But if technology is either widely deployed and under-featured or decently featured but not widely deployed, then applications need to use only the under-featured technology stack in order to have a large enough potential install base to justify development costs. In other words, SAP often forces itself to build applications on sub-par technology.

We see this dynamic constantly with SAP. One example is UI technology. Clearly a long-standing weakness of SAP's applications has been user-interface and user-experience. The SAP GUI is outdated and SAP's attempts to improve on it like the Business Client have, in my opinion, been torpedoed by SAP's monetization reflex. They haven't initially been better enough to see widespread adoption under a monetization scheme, and with lack of adoption, investment has faded. A similar dynamic played out with SAP's support for Flash and Silverlight as UI technologies for years after it had become clear they were destined for the trash-heap of web-UI delivery technologies.

And yet, over the last 4 years, SAP has been able to overcome this tendency in one area that may be incredibly important to its long-term business prospects around the newly announced S/4HANA. In the case of a key succession of technologies, SAP has been able to do something different, with impressive results. Those technologies: Gateway, SAP UI5, and Fiori**.

Initially, all seemed to be going well. SAP developed Gateway in part due to prodding from influencers like SAP Mentors around the need for a modern, RESTful API layer in SAP applications to allow the development of custom user interfaces and add-ons in a sustainable manner. DJ Adams and others showed the way with projects like the Alternative Dispatch Layer (in 2009), to make development of these APIs easier. Uwe Fetzer taught the ABAP stack to speak JSON. And suddenly one could fairly easily create modern-looking APIs on SAP systems. SAP folded a lot of those learnings into Gateway, which was a great way to push the tools for building modern APIs into SAP's applications and out to the install-base.

Well, it would have been, but SAP made its usual mistake: it attempted to monetize Gateway.

The result of the monetization attempt could be predicted well enough. It would make roll-out of Gateway to the SAP install base slow, at best. Fiori would be delayed because it would need to build out its own API layer rather than relying on Gateway technology. Applications like Simple Finance or S/4HANA that are dependent on Fiori would subsequently be delayed, if they were created at all. Perhaps Fiori's roll-out is delayed by a year, Simple Finance by 2 years, and S/4HANA isn't announced until 2018.

But S/4HANA was announced in January 2015. So what went differently this time?

Fortunately, back in 2011, shortly after the initial Gateway monetization strategy was announced, SAP Mentors like Graham Robinson and other community members stepped in to explain why this was a mistake and push back against the strategy, both publicly and privately. While clearly not the only reason for SAP's change of heart, this feedback from the community was powerful, and ultimately SAP revised Gateway licensing in 2012 so that it made sense for SAP, its partners, and customers to build new applications using Gateway.

This revision set us on the path to relatively quick uptake of SAP UI5 (a modern browser-based front-end framework which leverages the Gateway APIs) and later Fiori and the Fiori apps (most of which are based on UI5 and Gateway APIs). With Fiori, SAP again thought short-term and attempted to monetize Fiori, only reversing course and including Fiori with existing Business Suite licenses after similar pressure from another group of SAP Mentors who had experienced customers' frustration with SAP's stagnant user-experience. These Mentors, community members, and analysts were able to communicate the role Fiori needed to play in guarding against SAP customers migrating to cloud vendors offering a more compelling user-experience story.

Fiori, meanwhile, is a hit and serves as the basis for Simple Finance and now S/4HANA - SAP's recently announced blockbuster refresh of the entire Business Suite offering, which SAP clearly hopes will drive its business for the next 10 years. Would that be happening now if SAP had remained on its standard course of attempting to monetize Gateway? I don't think so. The interaction with SAP on these topics often left some feathers a bit ruffled, but it sure must be nice for those like Graham and DJ to see some of the fruits of those discussions in major products like S/4HANA.

*A note on HANA: I think that HANA may be the exception in this story. Unlikely most of SAP's other technology offerings, HANA is good enough to be competitive on its own, and not simply as a platform for SAP's applications. The results are not yet in on the HANA monetization strategy, but things are looking OK, if not great. Of course, Graham has something to say about that, and what he says is always worth reading.

** Fiori is actually a design language focused on user-experience and a line of mini-applications implementing the Fiori design language to improve SAP Business Suite user-experience for specific business processes. However in the context of S/4HANA we can think of Fiori as a prerequisite and enabling component, like a technology prerequisite.

Store everything: the lie behind data lakes

I hope the data lake idea has passed the peak of its current, unfortunately large, round of hype. Gartner came down on the concept pretty hard last year, only 4 years after the term was coined by Pentaho's James Dixon. More recently, Michael Stonebraker illustrated many of the same concepts from the perspective of a data management professional (note the excellent comment by @datachick). The frustration of conscientious data professionals with this concept is palpable.

the initial worry - wasted effort

The idea of a data lake is that we establish a single huge repository to store all data in the enterprise in its original format. The idea is that the processes of data transformation, cleansing, and normalization result in loss of information that may be useful at a later date. To avoid this loss of information, we should store the original data rather than transforming it before storing it.

This is a relatively noble goal, but it overlooks the fact that, done correctly, these processes also introduce clarity and reduce false signals that are latent in the raw data. If every data analysis project needs to recreate the transformation and cleansing process before doing their analysis, this will result in duplicated work and errors. Long live the datawarehouse.

The deeper problem - adding information

In my reading, this sums up the usual justification for data lakes as well as the source of frustration usually expressed by data professionals. But I think there is another issue with data lakes that is often overlooked: that the data lake approach implies that transformation is an information-negative operation. That transforming necessarily discards data, and therefore information, from the original data set. It is a response to a common frustration with datawarehousing - that the data in the warehouse doesn't quite answer the question we are trying to ask and if only we could get at the source data we could reconstruct the dataset so as to be able to answer the problem. Sometimes true.

Usually, however, there are real issues with this approach. The transformation from raw data to datawarehouse (or other intermediate representation)  may remove information, but it adds other significant information to the source data. Specifically, it adds or acts on information about what data can go together and how it can be combined, how the data collection method may have influenced the data, sometimes weights data according to external information about its quality, might add interpolated data, etc. Usually this additional information is not stored along with the source data set. It's rare that it's stored in an accessible way along with the resulting data in a datawarehouse. I've seen almost no mention of maintaining this type of information in the context of a data lake. It is simply lost, or at best difficult to access.

Existential questions - If a tree falls...

Which brings me to the big lie around data lakes: storing everything. If a measurement is never taken, is it data? How many measurements are not taken for each measurement that is? Can a data lake store these un-measured measures?

The issue in data analysis, I find, is not so much that data to answer our question was collected but discarded. Rather, the problem is that the data was never collected, the measurements never taken, or taken improperly for use in addressing the question now at hand. Like the traveller in Frost's poem, we have limited resources. Collecting one measure is a road that precludes collecting other measures due to these limits. Storing everything, or even most things, is not possible.

Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.
— Robert Frost, "The Road not Taken"

Later, we often wonder if taking the other road would have allowed us to answer our current questions, but we can't go back and make different decisions about measurement. Data lakes don't change that. Maybe data lake vendors should switch to selling time machines.