Generative AI - Abstractions, trade-offs, and learning by doing

A disagreement or question that comes up often in the generative AI space is whether people should use generative AI without having a deep understanding of how generative AI systems work. Another way to put this is the ask if we should rely on higher level abstractions in our work.

I’m personally focused on the professions of developers and architects, so that’s what I’ll talk about today. We all rely on abstractions, but making the right decisions about the level of abstractions we use is what differentiates a great developer or architect from a mediocre one. This isn't a new discussion, and comes up often with both new and older technologies. I often argue that developers and architects should be on the more conservative side here. Why is this?

I think it's important to take a couple of things into account when making these recommendations.

  1. What is the potential downside of implementing without full understanding?

  2. How much does implementation facilitate learning?

These two considerations weigh against each other when trying to decide if it makes sense to implement without deep understanding. If implementation will help us learn about and avoid the tradeoffs and downsides of a technology, then this is an argument for learning by doing. If potential downsides are very large, then this is an argument against. Remember, these attributes weigh against each other, so even if potential downside is small, if the learning upside is also small, it may still make sense to avoid an abstraction.

All of this must, of course, be weighed against other factors such as overall value (which includes downside). So this consideration is just one of many that needs to be taken into account.

So what of generative AI? Unfortunately here the potential downsides are fairly high and the ability to learn through abstraction is quite low. The prompt itself is an abstraction that seems to work against learning what is really going on in generative AI models such as LLMs, and which itself spawns troublesome secondary abstractions such as anthropomorphizing LLMs and certain types of magical thinking about what LLMs are really capable of.

I believe the reason for this lack of learning is that the primary failure modes of LLMs are usually invisible to non-experts. Hallucinations can only be spotted if the user already has good knowledge of the domain, is able to think critically about it, and is willing to make the effort to do their own research. Similarly, bias in all it's forms, including racism and sexism, is invisible at least as often as it is visible, and both filter models and fine-tuning have been deployed on LLMs to make bias even more invisible than it would otherwise be.

For these reasons, I strongly urge developers and architects intending to deploy these models in products and infrastructure to first gain a good understanding of how these models really work. This is not as difficult as it may seem. Bea Stollnitz's blogs on these topics are accessible, detailed, and link to the original papers on these topics, which themselves are usually written in a largely accessible manner for those with some technical background. Other resources and explainers have also become available if you prefer a different style or want to know about a different model (Stollnitz mostly focuses on GPT LLMs).

It’s important to remember that this recommendation is for developers and architects looking to deploy these models in products. For developers and architects trying to use these models as day-to-day assistants to augment their own capabilities, the calculus is quite different! Developers are domain experts in their own domain, so they are in a much better position to spot and address hallucinations. Additionally, racism and sexism biases in these models are unlikely to express themselves in a damaging way when we are trying to program algorithms or get feedback on coding practices and style from an LLM like Github CoPilot. Because the downsides are mitigated and smaller respectively, it is much more likely for a developer to be able to responsibly and productively use tools like Github CoPilot or ChatGPT as assistants in their professional work. The same is true for most domain experts.

Notes on web scraping

Well hello there. You may remember me from blogs written well over a year ago! Glad to be back.

Yesterday I tweeted about web scraping, mentioning that it's always a bad idea. What I mean by "web scraping" is the practice of using automated tools to harvest data from a website. Usually this is done in order to try to gather data for research or for use in a product.

That was, I'm sure a bit too strong. After all, universals are always incorrect. But I think it's basically the right approach and I'd like to explain a little bit why that is.

The problems with scraping

Web scraping is problematic for several reasons, among them:

  • Legality, in that scraping is usually prohibited by terms of service, and the use of scraped content is usually a copyright violation baring fair use exceptions. So in scraping you are risking running afoul of wonderful laws like (in the US) the CFAA and DMCA. If you've heard of Aaron Schwartz, you've heard of someone who got slammed with the CFAA for web scraping.
  • Ethics, in that regardless of law it can be unethical to use content that doesn't belong to you, and in scraping you risk bypassing restrictions on content use that may have been put in place intentionally and which you may not even be aware of. Remember, the fact that information is technically accessible, or even public, doesn't mean that it is not sensitive. Example: Your whereabouts in a public street is technically public information, at least in the US. Your neighbor waving at you when you leave your house is great. Someone sitting outside your house 24/7 and tweeting whenever you come and go is grounds for a restraining order.
  • Technical and financial concerns such as denial of service. You don't know what kind of capacity the website you have has available, nor how they are paying for it. It could well be that every request you send costs the website incremental money, or that their server will simply crash due to your scraping. You can't know this ahead of time if you haven't consulted with the person or organization running the website. You can be careful, and you can even respect technical indicators like robots.txt files, but you still won't really know unless you coordinate.

Instructors: Keep in mind that when you indicate to your students that it is OK to web scrape to gather data for their projects, you are opening them up to real liability. Encouraging your students to violate the CFAA is no small thing, no matter how bad a law it is.

What are the alternatives?

First, I'd like to remind us all to keep in mind that we are not generally entitled to use resources we don't own in the service of our own projects. Now, there are certainly situations in which using resources without the owner's permission can be justified, and we'll get to that down around step 6 or 7, but all other things being equal, if there is a better option, you should use it.

My recommendation when looking to acquire bulk data for research purposes is to do the following in this order:

  1. Look for bulk download options provided by the content owner. It is very common to see people talk about scraping sites like Wikipedia, IMDB, Github, or StackOverflow. Guess what! No need, as all of these sites provide bulk data download options.
  2. If no bulk download is available, see if there is an API that might provide what you need within the API terms of use.
  3. Check the terms of service or user agreement to see if by some chance it actually explicitly allows automated scraping in some form. If it does, then go ahead, but this is a rare case.
  4. Email the organization or person running the site and ask if it would be possible to arrange access to the data you are interested in.
    1. You may want to do this well ahead of the time you'll need the data so that there is time for negotiation, legal agreements, etc.
    2. Simultaneously, you may want to ask around among your network and advisors as to whether there is a version of the bulk data available already that you may have missed.
  5. If the organization is not willing to allow use of their data, see if there is an alternative data set that could serve the same purpose. Note that this may require changing your research plan. That's fine! Part of building a good research plan is being able to execute it ethically and legally. It's a good skill to develop.
  6. Consult a lawyer. And an ethicist if there is anything conceivably sensitive about the data you are accessing.
  7. If you've determined that the merits of your need outweigh the legal and ethical issues around ownership and permission, and you are willing to accept the potential consequences of being wrong, scrape.

But...

... some people won't agree with the ethical points above, and keep in mind that I'm not a lawyer, nor do I really know anything about the law in this area. I'm laying out a fairly conservative position. If you disagree and your eyes are open to the possible consequences of being wrong, ranging from a mild slap on the wrist to career suicide and possible jail time, then who am I to stop you?

But if you are an instructor teaching a class, a lead researcher advising your subordinates, or an employee or contractor doing a project for a company, there is more at stake than just yourself. You could well get your students in trouble or subject your company to a damaging lawsuit without realizing that what you are doing is problematic. So think twice and look for alternatives.

Reductio

Crossfilter has played a pretty big part in my work for the last couple of years. For those who don't know, Crossfilter is a Javascript library that "supports extremely fast (<30ms) interaction with coordinated views, even with datasets containing a million or more records" in the web browser. It is designed for use in visualizations and applications that allow interaction with and filtering of multiple dimensions of a single data set.

I've found Crossfilter to be a very powerful tool for building applications that support data interactions that feel 'real-time'. The idea is that <100ms interactions are fast enough to allow us to identify patterns and correlations as the visualization changes in exact coordination with our filtering. I use Crossfilter in several places, but the biggest project is easily Palladio, which is a testbed platform for research activities and a powerful example of the possibilities of browser-based data exploration tools.

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Faceted browsing in Palladio - driven by Crossfilter and Reductio

Crossfilter consistently trips up people trying to use it for the first time because, I believe, its programming model for aggregations is inconsistent with the model we usually expect. It's for this reason that I started Reductio, a library that helps build up aggregation functions that work correctly and efficiently with Crossfilter's model.

I'm not going to get into all of the details here, but defining aggregations in Crossfilter requires defining functions that incrementally update the aggregation based on the addition or removal of records. This is at odds with the standard way of computing aggregations by building them up from scratch that we see in SQL or more standard map-reduce models.*

// A function that can be passed to Array.reduce to sum the array
function(a, b) {
  return a + b;
}
// The equivalent operation in Crossfilter requires 3 functions
function reduceAdd(p, v) {
  return p + v.total;
}

function reduceRemove(p, v) {
  return p - v.total;
}

function reduceInitial() {
  return 0;
}

It's these aggregation functions that consistently trip people up. The summation case is simple enough because summation is an operation that is computationally equivalent given the running total and the value to be added or removed. But what about operations that are not reversible? Take the computation of a maximum. When adding a new record to a maximum aggregation, we just need to check if the value of the record is larger than the current largest value we've seen (the current maximum).

// A function that can be passed to Array.reduce to return the max
function (a, b) {
  return Math.max(a, b);
}

But if we remove a record with a value equal to the current maximum, we need to know the next smallest value. And if we remove the record with that value we need to know the next smallest, and so on. We have to keep track of all values seen in order to avoid needing to rebuild the entire aggregation from scratch when a record is removed. And yes, this is significantly faster than rebuilding from scratch!

// The equivalent operation in Crossfilter (optimized, with
// error guards). Adapted from Reductio
var bisect = crossfilter.bisect.by(function(d) { return d; }).left;

function reduceAdd(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 0, v.number);
    p.max = p.values[p.values.length - 1];
    return p;
}

function reduceRemove(p, v) {
    i = bisect(p.values, v.number, 0, p.values.length);
    p.values.splice(i, 1);
    
    // Check for undefined.
    if(p.values.length === 0) {
        p.max = undefined;
        return p;
    }
 
    p.max = p.values[p.values.length - 1];
    return p;
}

function reduceInitial() {
  return { values: [], max: 0 };
}

It's a lot of work to come up with these aggregation functions, make them efficient, and test them thoroughly, and Crossfilter beginners (myself included) consistently struggle with this. Believe it or not, things get even nastier when dealing with thing like pivot- or exception-aggregations, when building aggregations where a record counts in more than one aggregation group (like moving averages, or tags), or when trying to compose multiple aggregations (commonly required for visualizations as simple as a scatter-plot). Reductio supports all of these aggregations and a few more, and I hope it will allow a quicker, smoother start when using Crossfilter.


* This difference in model is to allow very fast interactions with data and aggregations. If you aren't building an application that requires real-time interactions and are thinking about using Crossfilter simply to do one-time calculations or aggregations on your data, think again about using Crossfilter as it is probably not the right tool for the job.

Skating to where the puck is

Being an SAP Mentor is an interesting situation. The relationship with SAP usually feels amazingly collaborative, but occasionally uncomfortably adversarial. ThIs is probably a healthy place for a relationship with a massive organization to be. Every organization includes many great people and interactions with those people are always delightful. Every organization is also, as an organization, predisposed to get the answers to certain questions consistently right and the answers to other sorts of questions consistently wrong.

One such question that SAP has consistently gotten wrong in the past is how to push technology adoption among a massive install base. Historically, it would appear that SAP has tended to try to monetize upgrades to underlying technology in order to recoup development costs or profit from its technology offerings. Netweaver, BW, Visual Composer, add-ons like SSO and Gateway, Personas, Business Client, and HANA* are all technologies in which SAP has either underinvested due to lack of direct revenue potential (BW especially) or has tried to monetize. The result is that widely deployed "free" technology from SAP often stagnates while advanced technology with an additional cost often sees limited adoption in SAP's install base.

This outcome is terrible for SAP's business. It makes it very difficult for SAP to keep its application offerings competitive in the marketplace. The reason is that basic application functionality to be competitive is often dependent on improvements in underlying technology. But if technology is either widely deployed and under-featured or decently featured but not widely deployed, then applications need to use only the under-featured technology stack in order to have a large enough potential install base to justify development costs. In other words, SAP often forces itself to build applications on sub-par technology.

We see this dynamic constantly with SAP. One example is UI technology. Clearly a long-standing weakness of SAP's applications has been user-interface and user-experience. The SAP GUI is outdated and SAP's attempts to improve on it like the Business Client have, in my opinion, been torpedoed by SAP's monetization reflex. They haven't initially been better enough to see widespread adoption under a monetization scheme, and with lack of adoption, investment has faded. A similar dynamic played out with SAP's support for Flash and Silverlight as UI technologies for years after it had become clear they were destined for the trash-heap of web-UI delivery technologies.

And yet, over the last 4 years, SAP has been able to overcome this tendency in one area that may be incredibly important to its long-term business prospects around the newly announced S/4HANA. In the case of a key succession of technologies, SAP has been able to do something different, with impressive results. Those technologies: Gateway, SAP UI5, and Fiori**.

Initially, all seemed to be going well. SAP developed Gateway in part due to prodding from influencers like SAP Mentors around the need for a modern, RESTful API layer in SAP applications to allow the development of custom user interfaces and add-ons in a sustainable manner. DJ Adams and others showed the way with projects like the Alternative Dispatch Layer (in 2009), to make development of these APIs easier. Uwe Fetzer taught the ABAP stack to speak JSON. And suddenly one could fairly easily create modern-looking APIs on SAP systems. SAP folded a lot of those learnings into Gateway, which was a great way to push the tools for building modern APIs into SAP's applications and out to the install-base.

Well, it would have been, but SAP made its usual mistake: it attempted to monetize Gateway.

The result of the monetization attempt could be predicted well enough. It would make roll-out of Gateway to the SAP install base slow, at best. Fiori would be delayed because it would need to build out its own API layer rather than relying on Gateway technology. Applications like Simple Finance or S/4HANA that are dependent on Fiori would subsequently be delayed, if they were created at all. Perhaps Fiori's roll-out is delayed by a year, Simple Finance by 2 years, and S/4HANA isn't announced until 2018.

But S/4HANA was announced in January 2015. So what went differently this time?

Fortunately, back in 2011, shortly after the initial Gateway monetization strategy was announced, SAP Mentors like Graham Robinson and other community members stepped in to explain why this was a mistake and push back against the strategy, both publicly and privately. While clearly not the only reason for SAP's change of heart, this feedback from the community was powerful, and ultimately SAP revised Gateway licensing in 2012 so that it made sense for SAP, its partners, and customers to build new applications using Gateway.

This revision set us on the path to relatively quick uptake of SAP UI5 (a modern browser-based front-end framework which leverages the Gateway APIs) and later Fiori and the Fiori apps (most of which are based on UI5 and Gateway APIs). With Fiori, SAP again thought short-term and attempted to monetize Fiori, only reversing course and including Fiori with existing Business Suite licenses after similar pressure from another group of SAP Mentors who had experienced customers' frustration with SAP's stagnant user-experience. These Mentors, community members, and analysts were able to communicate the role Fiori needed to play in guarding against SAP customers migrating to cloud vendors offering a more compelling user-experience story.

Fiori, meanwhile, is a hit and serves as the basis for Simple Finance and now S/4HANA - SAP's recently announced blockbuster refresh of the entire Business Suite offering, which SAP clearly hopes will drive its business for the next 10 years. Would that be happening now if SAP had remained on its standard course of attempting to monetize Gateway? I don't think so. The interaction with SAP on these topics often left some feathers a bit ruffled, but it sure must be nice for those like Graham and DJ to see some of the fruits of those discussions in major products like S/4HANA.


*A note on HANA: I think that HANA may be the exception in this story. Unlikely most of SAP's other technology offerings, HANA is good enough to be competitive on its own, and not simply as a platform for SAP's applications. The results are not yet in on the HANA monetization strategy, but things are looking OK, if not great. Of course, Graham has something to say about that, and what he says is always worth reading.

** Fiori is actually a design language focused on user-experience and a line of mini-applications implementing the Fiori design language to improve SAP Business Suite user-experience for specific business processes. However in the context of S/4HANA we can think of Fiori as a prerequisite and enabling component, like a technology prerequisite.