Notes on web scraping

Well hello there. You may remember me from blogs written well over a year ago! Glad to be back.

Yesterday I tweeted about web scraping, mentioning that it's always a bad idea. What I mean by "web scraping" is the practice of using automated tools to harvest data from a website. Usually this is done in order to try to gather data for research or for use in a product.

That was, I'm sure a bit too strong. After all, universals are always incorrect. But I think it's basically the right approach and I'd like to explain a little bit why that is.

The problems with scraping

Web scraping is problematic for several reasons, among them:

  • Legality, in that scraping is usually prohibited by terms of service, and the use of scraped content is usually a copyright violation baring fair use exceptions. So in scraping you are risking running afoul of wonderful laws like (in the US) the CFAA and DMCA. If you've heard of Aaron Schwartz, you've heard of someone who got slammed with the CFAA for web scraping.
  • Ethics, in that regardless of law it can be unethical to use content that doesn't belong to you, and in scraping you risk bypassing restrictions on content use that may have been put in place intentionally and which you may not even be aware of. Remember, the fact that information is technically accessible, or even public, doesn't mean that it is not sensitive. Example: Your whereabouts in a public street is technically public information, at least in the US. Your neighbor waving at you when you leave your house is great. Someone sitting outside your house 24/7 and tweeting whenever you come and go is grounds for a restraining order.
  • Technical and financial concerns such as denial of service. You don't know what kind of capacity the website you have has available, nor how they are paying for it. It could well be that every request you send costs the website incremental money, or that their server will simply crash due to your scraping. You can't know this ahead of time if you haven't consulted with the person or organization running the website. You can be careful, and you can even respect technical indicators like robots.txt files, but you still won't really know unless you coordinate.

Instructors: Keep in mind that when you indicate to your students that it is OK to web scrape to gather data for their projects, you are opening them up to real liability. Encouraging your students to violate the CFAA is no small thing, no matter how bad a law it is.

What are the alternatives?

First, I'd like to remind us all to keep in mind that we are not generally entitled to use resources we don't own in the service of our own projects. Now, there are certainly situations in which using resources without the owner's permission can be justified, and we'll get to that down around step 6 or 7, but all other things being equal, if there is a better option, you should use it.

My recommendation when looking to acquire bulk data for research purposes is to do the following in this order:

  1. Look for bulk download options provided by the content owner. It is very common to see people talk about scraping sites like Wikipedia, IMDB, Github, or StackOverflow. Guess what! No need, as all of these sites provide bulk data download options.
  2. If no bulk download is available, see if there is an API that might provide what you need within the API terms of use.
  3. Check the terms of service or user agreement to see if by some chance it actually explicitly allows automated scraping in some form. If it does, then go ahead, but this is a rare case.
  4. Email the organization or person running the site and ask if it would be possible to arrange access to the data you are interested in.
    1. You may want to do this well ahead of the time you'll need the data so that there is time for negotiation, legal agreements, etc.
    2. Simultaneously, you may want to ask around among your network and advisors as to whether there is a version of the bulk data available already that you may have missed.
  5. If the organization is not willing to allow use of their data, see if there is an alternative data set that could serve the same purpose. Note that this may require changing your research plan. That's fine! Part of building a good research plan is being able to execute it ethically and legally. It's a good skill to develop.
  6. Consult a lawyer. And an ethicist if there is anything conceivably sensitive about the data you are accessing.
  7. If you've determined that the merits of your need outweigh the legal and ethical issues around ownership and permission, and you are willing to accept the potential consequences of being wrong, scrape.

But...

... some people won't agree with the ethical points above, and keep in mind that I'm not a lawyer, nor do I really know anything about the law in this area. I'm laying out a fairly conservative position. If you disagree and your eyes are open to the possible consequences of being wrong, ranging from a mild slap on the wrist to career suicide and possible jail time, then who am I to stop you?

But if you are an instructor teaching a class, a lead researcher advising your subordinates, or an employee or contractor doing a project for a company, there is more at stake than just yourself. You could well get your students in trouble or subject your company to a damaging lawsuit without realizing that what you are doing is problematic. So think twice and look for alternatives.

Reductio

Crossfilter has played a pretty big part in my work for the last couple of years. For those who don't know, Crossfilter is a Javascript library that "supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records" in the web browser. It is designed for use in visualizations and applications that allow interaction with and filtering of multiple dimensions of a single data set.

I've found Crossfilter to be a very powerful tool for building applications that support data interactions that feel 'real-time'. The idea is that <100ms interactions are fast enough to allow us to identify patterns and correlations as the visualization changes in exact coordination with our filtering. I use Crossfilter in several places, but the biggest project is easily Palladio, which is a testbed platform for research activities and a powerful example of the possibilities of browser-based data exploration tools.

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Crossfilter consistently trips up people trying to use it for the first time because, I believe, its programming model for aggregations is inconsistent with the model we usually expect. It's for this reason that I started Reductio, a library that helps build up aggregation functions that work correctly and efficiently with Crossfilter's model.

I'm not going to get into all of the details here, but defining aggregations in Crossfilter requires defining functions that incrementally update the aggregation based on the addition or removal of records. This is at odds with the standard way of computing aggregations by building them up from scratch that we see in SQL or more standard map-reduce models.*

// A function that can be passed to Array.reduce to sum the array
function(a, b) {
  return a + b;
}
// The equivalent operation in Crossfilter requires 3 functions
function reduceAdd(p, v) {
  return p + v.total;
}

function reduceRemove(p, v) {
  return p - v.total;
}

function reduceInitial() {
  return 0;
}

It's these aggregation functions that consistently trip people up. The summation case is simple enough because summation is an operation that is computationally equivalent given the running total and the value to be added or removed. But what about operations that are not reversible? Take the computation of a maximum. When adding a new record to a maximum aggregation, we just need to check if the value of the record is larger than the current largest value we've seen (the current maximum).

// A function that can be passed to Array.reduce to return the max
function (a, b) {
  return Math.max(a, b);
}

But if we remove a record with a value equal to the current maximum, we need to know the next smallest value. And if we remove the record with that value we need to know the next smallest, and so on. We have to keep track of all values seen in order to avoid needing to rebuild the entire aggregation from scratch when a record is removed. And yes, this is significantly faster than rebuilding from scratch!

// The equivalent operation in Crossfilter (optimized, with
// error guards). Adapted from Reductio
var bisect = crossfilter.bisect.by(function(d) { return d; }).left;

function reduceAdd(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 0, v.number);
    p.max = p.values[p.values.length - 1];
    return p;
}

function reduceRemove(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 1);
    
    // Check for undefined.
    if(p.values.length === 0) {
        p.max = undefined;
        return p;
    }
 
    p.max = p.values[p.values.length - 1];
    return p;
}

function reduceInitial() {
  return { values: [], max: 0 };
}

It's a lot of work to come up with these aggregation functions, make them efficient, and test them thoroughly, and Crossfilter beginners (myself included) consistently struggle with this. Believe it or not, things get even nastier when dealing with thing like pivot- or exception-aggregations, when building aggregations where a record counts in more than one aggregation group (like moving averages, or tags), or when trying to compose multiple aggregations (commonly required for visualizations as simple as a scatter-plot). Reductio supports all of these aggregations and a few more, and I hope it will allow a quicker, smoother start when using Crossfilter.


* This difference in model is to allow very fast interactions with data and aggregations. If you aren't building an application that requires real-time interactions and are thinking about using Crossfilter simply to do one-time calculations or aggregations on your data, think again about using Crossfilter as it is probably not the right tool for the job.

Store everything: the lie behind data lakes

I hope the data lake idea has passed the peak of its current, unfortunately large, round of hype. Gartner came down on the concept pretty hard last year, only 4 years after the term was coined by Pentaho's James Dixon. More recently, Michael Stonebraker illustrated many of the same concepts from the perspective of a data management professional (note the excellent comment by @datachick). The frustration of conscientious data professionals with this concept is palpable.

the initial worry - wasted effort

The idea of a data lake is that we establish a single huge repository to store all data in the enterprise in its original format. The idea is that the processes of data transformation, cleansing, and normalization result in loss of information that may be useful at a later date. To avoid this loss of information, we should store the original data rather than transforming it before storing it.

This is a relatively noble goal, but it overlooks the fact that, done correctly, these processes also introduce clarity and reduce false signals that are latent in the raw data. If every data analysis project needs to recreate the transformation and cleansing process before doing their analysis, this will result in duplicated work and errors. Long live the datawarehouse.

The deeper problem - adding information

In my reading, this sums up the usual justification for data lakes as well as the source of frustration usually expressed by data professionals. But I think there is another issue with data lakes that is often overlooked: that the data lake approach implies that transformation is an information-negative operation. That transforming necessarily discards data, and therefore information, from the original data set. It is a response to a common frustration with datawarehousing - that the data in the warehouse doesn't quite answer the question we are trying to ask and if only we could get at the source data we could reconstruct the dataset so as to be able to answer the problem. Sometimes true.

Usually, however, there are real issues with this approach. The transformation from raw data to datawarehouse (or other intermediate representation)  may remove information, but it adds other significant information to the source data. Specifically, it adds or acts on information about what data can go together and how it can be combined, how the data collection method may have influenced the data, sometimes weights data according to external information about its quality, might add interpolated data, etc. Usually this additional information is not stored along with the source data set. It's rare that it's stored in an accessible way along with the resulting data in a datawarehouse. I've seen almost no mention of maintaining this type of information in the context of a data lake. It is simply lost, or at best difficult to access.

Existential questions - If a tree falls...

Which brings me to the big lie around data lakes: storing everything. If a measurement is never taken, is it data? How many measurements are not taken for each measurement that is? Can a data lake store these un-measured measures?

The issue in data analysis, I find, is not so much that data to answer our question was collected but discarded. Rather, the problem is that the data was never collected, the measurements never taken, or taken improperly for use in addressing the question now at hand. Like the traveller in Frost's poem, we have limited resources. Collecting one measure is a road that precludes collecting other measures due to these limits. Storing everything, or even most things, is not possible.

Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.
— Robert Frost, "The Road not Taken"

Later, we often wonder if taking the other road would have allowed us to answer our current questions, but we can't go back and make different decisions about measurement. Data lakes don't change that. Maybe data lake vendors should switch to selling time machines.

Palladio

I've been working on this for a while, so it's probably worth posting something about it.

Palladio is a platform for data visualization and exploration designed for use by humanities researchers. We're in early beta at the moment and will be doing a series of releases throughout 2014. You can read about it and try it out here: http://palladio.designhumanities.org/

I think the website does a pretty good job of explaining the capabilities of the platform, so I'll leave that for the moment. I encourage you to go check it out before reading on, because it will be worth understanding how the platform works to give some context to the discussion below.

The map view in the Palladio interface

So, why am I super-excited about this? Mostly because of the great team, which has the vision, technical skills, theoretical and domain knowledge, and information design chops to pull off this type of project. I consider myself lucky to be able to work with this group.

It's also great to be working on a project like this for a field that is simultaneously very strong on information theory and a bit underserved in terms of some types of tools. This is in stark contrast to my usual enterprise data management and visualization work where the theory tends to be weak but a plethora of tools exist.

In addition to trying to build a tool that incorporates important and underserved aspects of humanistic inquiry, I am excited to work with a team that buys into introducing state-of-the-art concepts around data exploration tools in general. Many of the concepts we are working to implement in Palladio are directly applicable to the types of data exploration problems we find in the enterprise and are concepts rarely expressed in existing tools. Palladio is a great example (one of many great examples) of how the process of humanistic inquiry can motivate the development of methods that are both technically and conceptually applicable in wildly different disciplines.

Interaction

The thing that initially most impresses people about Palladio is the way that filtering and movement are integral to the visualization. Specifically, the visualizations update and move in real-time as you filter. This is not a new concept, but I don't think I've ever seen it fully implemented in a general-purpose tool. Getting the level of movement right is a design challenge that the team is tackling as a work in progress, but in my opinion this characteristic of real-time updates and movement is a key feature for a data exploration tool, and few if any tools implement it.

I'll try not to get too squishy here, but this behavior of the tool allows a person to interact with the data in a very direct way, giving a feel for the data that would not otherwise exist. When you can see the results of your interactions with the data in real time, it is a lot easier to conceptually link step-changes and interesting events with the actions that caused them. For example, dragging a filter along the timeline component allows you to play back history at your own speed, speeding up or slowing down as suits you. My theory-foo is weak, but when you see it, I think you'll understand. Try it out with the sample data.

Browser

Techy alert: Palladio is a purely client-side, browser-based application. The only involvement of a server is to deliver the HTML, Javascript, and CSS files that comprise Palladio. We arrived at this design through a few iterations, but the motivation was that we wanted to be cross-platform, super-easy to install, and still support pretty big data sets and fluid interactions. 10 years ago, this would have been nearly impossible, but today we have web browsers that, amazingly, are capable of supporting these types of applications. Yes, browsers can now support dynamically filtering, aggregating, and displaying 10s and 100s of thousands of rows of data and displaying hundreds of data points simultaneously in SVG; thousands of data points if you use canvas instead.

The time for client-side data visualization in the browser has come and we are taking advantage of that in a big way. A great strength of browser-rendered visualizations is that they allow true interaction with the visualization. Just using SVG or Canvas as a nicer replacement for static images is fine, but it isn't fully exploiting the medium. Add to this that the type of interactivity we are providing with Palladio is technically impossible in a client-server setup. Even if the server responds to queries instantaneously, the round-trip time the client-server communication introduces means that interactions won't be as closely linked as they are in Palladio, severely degrading the quality of the interactive experience.

Admittedly, we have work to do on performance and our cross-browser support could be better. Additionally, the problem of data that simply doesn't fit in the browser's memory remains unaddressed, though we have some ideas for mitigating the problem. But I think this is an application design approach that could be exploited for the vast majority of data sets out there, either because the data itself is relatively small, or through judicial use of pre-aggregation to avoid performance and size issues.

Design

Lastly, user experience and information design have been integral components of this project from the start. The design has been overhauled several times along the way, and I wouldn't be at all surprised if it happened again. To be clear, I'm a complete design newb, but we have a great designer working on the team. One thing that has become clear to me through this process is that designing a general purpose interactive visualization tool is hard. There are more corner-cases than I previously imagined possible, but we are trying to get the big pieces right and I think we're on the road to success.

Obviously the organizational dynamics on a small team like ours are very different than those in a big development organization, but it seems like information design on most of the enterprise data exploration tools from larger vendors either started out problematic and stayed that way, or started out pretty well and started slipping as the tool took off. I'm not sure if there is an answer to this, but it's clear that when building a tool in this space, having at least one information designer with a strong voice on the team is indispensable.

Let me sum up

So, that's the Palladio project, as well as a few takeaways that I feel can be applied back to pretty much any data exploration project. In closing, I'll just mention that none of this would be possible without some great open source projects that we use and, most importantly, without the great team working on this as well as the feedback and patience of our dedicated beta participants. The central Javascript libraries we used to pull this off are d3.js, Crossfilter, and Angular.js. The members of the core team are Nicole Coleman (our fearless leader), Giorgio Caviglia (design and visualization), Mark Braude (documentation, testing, working with our beta users, and project management), and myself doing technical implementation work. Dan Edelstein is the principal investigator.

It's been a great ride so far and we've got some exciting things planned for the rest of the year. This is definitely a work in progress, and feedback is very welcome. Follow the Humanities + Design lab on Twitter for updates.