6. The Web

Main topic:

In the 1990’s, hypertext finally got traction and everyone wanted to be online, giving everyone else a reason to be online. The Internet, previously a tiny hamlet, was now a rapidly growing megalopolis. To many of us that were online in the decade before HTML, HTTP, and the Mosaic, it felt like overnight, the all of our networks grew useful new connections to each other.

Before the web, academics and other researchers had Usenet News for online discussions and the sharing of code and occasionally data. Posts were plain text. Source code was sometimes included right in a post. More often, code was hosted on an FTP server, and the ftp host and path were posted.

The experience of being online was that of membership in a small community of people open to sharing their code and their experience. The tools for communicating were limited, until the web. In almost every respect, the technology of the web was not innovative. We already had markup languages and application protocols – both client and server code for many was readily available.

But the web had the right design at the right time.

The first HTTP protocol made one request at a time. Networks were fast enough (though not fast) to allow a client (soon to be called a browser) to initiate a TCP session every time they made a request. HTML was a very stripped down use of SGML (from the mid 1980s), and simple enough to make it easy to build a browser. The core idea was, initially, to deliver static content without regard for security issues like privacy or integrity.

The NCSA Mosaic browser spread widely, starting in 1993. Non-commercial use was free of charge, and source code was available.

The network effect was readily apparent: The more people and institutions that had a presence on the web, the more wanted to be there. This positive feedback loop quickly brought commerce online. HTTP was revised for efficiency; HTTPS enabled some security; and HTML made a quantum leap with the addition of Javascript.

Web servers grew more complex, with a variety of server-side scripting capabilities and full-fledged applications backed by databases. Browsers grew more complex, largely because they now needed to execute code in addition to rendering marked-up content, but also because users demanded that any browser be able to deal with myriad non-standard “vendor extensions” to and “vendor interpretations” of HTML. (Reportedly, a large amount of code in current browsers, in 2024, exists for this purpose. Only recently has support for the quirks and outright deviance of Microsoft Internet Explorer been deprecated in modern browsers.)

The web is now firmly entrenched across the globe, serving both as an application platform for users (via browsers) and applications (via APIs provided over HTTP).

One lesson to draw here is that simplicity can drive adoption. HTML was simple compared to many alternatives, making it easier to write documents and to write a browser. HTTP was very simple, making both server and client easier to implement. (Network bandwidths were low, so page download and rendering took a lot of time. The extra overhead of establishing a TCP connection per request was not so problematic in the early days.)

As the web grew, its specifications grew in complexity. But making a document editor for an HTML with 30 tags does not look daunting when you already have one that supports 15 tags. Making a web server that handles severals requests per connection is not daunting when you already have one that handles one request per connection.

Still, as all the pieces of the web grew in complexity, and the web itself became available in more and more places on earth, the challenges of building and maintaining web-based systems grew commensurately.

How do we design products and services that scale to millions of users living all across one large country like the U.S. or across many countries like the E.U.?

Investigation questions:

How does the DNS system work? How is it organized?
What is ICANN and what do they do?
What is the Address Resolution Protocol (ARP)?
What is the Border Gateway Protocol (BGP)? What is an Autonomous System (AS)?
Why and how is Ukraine internet traffic being routed through Russia (2022-2024)?
Who was Jon Postel, and what is Postel’s Law?

Discussion topics:

Do you agree with Postel’s Law? Why or why not?
Pros and cons of a plain-text protocol versus a binary one
HTTP is a stateless protocol. What does that mean?
What was internet latency and bandwidth like in 1993?
What is a session? How are sessions implemented, given that HTTP is stateless?
Designing in layers, e.g. the role of TLS in HTTPS
Is HTTPS sufficient “security” for online purchases? Online banking? Online access to government services? Consider the various entities in these scenarios, and what assurances they might want.

Assessing complexity

There is no single measure of complexity that applies across all kinds of software-based systems. Complexity here is not the computational complexity we study in algorithms courses. And it’s not (necessarily) the same phenomenon studied in chaos theory or ecological systems – though there can be striking similarities.

A precise formula for software system complexity may not be obtainable, or may not be illuminating. Yet, we know it when we see it.

Recall that a system is, by definition, a collection of interacting components. If there are a lot of components, and a variety of interactions, we might call a system complex. This prompts a question: How do we measure these?

For instance, how large is an application? We can measure it in lines of code, but this metric has limited use. It does correlate well with the number of bugs, and it certainly affects how well a person can understand the codebase. These are valid and important developer-centric concerns, but they do not tell the entire story.

The back end of an application might be measured in terms of the number of servers it takes to run it. A single-server application, while being a single point of failure, is also easy to manage. We can regularly check the CPU load, whether the disk is getting full, whether its network interface is saturated with traffic, and other indicators of server health. We can also look directly at the application, maybe to see if some queues are getting large or if response times are increasing.

A large application probably has many servers, possibly thousands or more. And they are not all performing the same task. Some are load balancers, some run business logic, some host databases. Another measure of an application might be a count of how many different kinds of “nodes” are in a graph or diagram of the entire deployment.

Many types of nodes probably means we need a variety of skills on our team, because understanding firewalls and load balancers is a different domain of expertise than, say, databases. Having many types of nodes also makes an application complex.

Complexity can be seen when, for example:

There are many components, or a wide variety in the types of components.
Interactions between components are not restricted to just a few flows, but are many and varied.
Configuration conflicts are possible and not easily detected. E.g. Suppose load balancer L has server A in its list of places to which it routes requests. It’s possible that server A has a local firewall that is blocking incoming connections from L.
Internal workings of the system are not observable. Does our system provide ways of peering inside, to see queue sizes or database activity or processing times? If not, then how will we debug a problem when one occurs? (One will eventually occur!)

Another aspect of complexity is not technical, but organizational. If all of the databases for all products are administered by a single team, the entire company gets the benefit of that team’s expertise. That doesn’t mean that none of the developers on my product need to know anything about databases. We will have to work with the database team, probably often. They will have to understand our product’s requirements (from a database perspective) and we will have to understand the particular database(s) that our company uses.

A form of application complexity arises when a product team must work with several teams like the hypothetical database team I describe here. Good communication and timely planning will be necessary to ship a quality application on time. It’s simpler to coordinate activities within one team or department, and more challenging to coordinate across teams and departments. This kind of complexity is usually invisible when looking at system architecture diagrams, but it can impact deployment and debugging, both of which can be made easier or harder by how well teams communicate and accommodate each other.

Modeling and controlling complex systems

Queuing theory and control theory are two of many mathematical models that can be used to analyze and build complex software systems.

First, let’s look back at the early days of web servers because the “standard” web server design has changed more than once since then.

Matt Welsh’s Staged Event-Driven Architecture (SEDA) work was aimed at solving the following problem. When a webserver (in the 1990’s) was overloaded with requests, it would crash (or hang). This is unsurprising if you know how Unix daemons work, which is by having a main thread listen for incoming requests (e.g. HTTP requests on port 80) and spawning a thread to handle each one.

In the 90’s, Linux threads were slow. (Windows was worse off, and never considered a viable OS for a server. MacOS before OS X used cooperative multitasking and was only a desktop OS.) When a Linux machine spawned too many threads, it got bogged down. If memory was the limiting factor, the kernel would start killing processes that requested more memory. Otherwise, there were just so many threads running – on a single core! – that each one made very little progress during its time on the CPU.

As the number of concurrently running threads (one per request) goes up, the time needed to service a request rises sharply. This is latency. And throughput, the number of requests served per unit time, plummets.

I don’t want to oversimplify SEDA, but I will highlight a critical design decision: Establish a pool of threads to handle requests, which are queued.

A thread pool is a fixed number of threads. The number can be adjusted based on the performance of the hardware or the complexity of the average query, or on any number of other metrics. The size of the pool is a tunable parameter.

The queue is a buffer. When requests arrive frequently, the queue holds them until they can be serviced. When the arrival rate slows, or there is a lull, the number of requests in the queue shrinks as requests are dequeued and serviced by threads in the pool.

Queuing theory

How do we know the queue won’t grow out of control? Put another way, given the size of the thread pool that our hardware can handle, how big do we expect our queue to get?

The response time for a request, from the perspective of the requester, is the sum of two durations: how long was the request in the queue, and how long did it take a thread to service the request.

If a queue grows large, containing many requests, the average response time lengthens. We’d like to know these quantities. How large will the queue get determines how much storage we need for it, and likely how it is implemented. How long the response time will get will determine whether the system meets its requirements.

Fortunately, there is a varied and deep theory of queuing. I’m not an expert in this subject, but I have worked with people who are. They consider a great many factors when predicting the capacity of a system, and they take many measurements (from prototypes and early versions) to inform their analyses.

A very basic introduction to queuing theory, like this one, is worth a read. Little’s Law relates the average number of requests in a system (queued plus being served) to the arrival rate of new requests and the time a request spends in the system.

What’s interesting is that this relation holds independent of the arrival distribution, and the relation can be extended to model different kinds of queues. In our web server example, we have a fixed pool of threads, but if we modeled customers in a grocery store, the number of cashiers may vary during the day. Similarly, our web requests (probably) cannot be “canceled”, so they will remain queued until served, whereas a grocery store customer may leave the line at any time. And in a queue at the Motor Vehicle Department, a customer being served may be told to fill out a form and then get back in line. They are being re-queued!

Queuing theory addresses all manner of systems in which queues are a feature.

Ultimately, though, we have to make decisions as designers about what our systems will do when the queue gets too large, or latency gets too high. That is, what should happen if the arrival distribution exceeds the capacity we have planned for?

Network routers simply drop packets when they are overloaded. Many web servers just drop requests, forcing us to notice that nothing is happening in our browser and click a link or button again.

But how can we reliably detect that a system is overloaded? Production systems are monitored by other systems! A monitoring system takes periodic measurements, such as queue size or CPU load or disk I/O performance. When a metric exceeds a user-defined threshold, it sends an alert.

In some cases, human operators deal with the alerts. For example, when a system is close to running out of disk space, an operator can provision more storage.

But ideally, we’d configure some action to happen automatically in response to an overloaded situation. This article on “back pressure” is an accessible treatment of that issue. (An archived version is on the Wayback Machine.) When possible, we should throttle the production of requests to alleviate load on our system. If we cannot do this, we can buffer requests – this is our queue. Finally, if we cannot queue any more requests, we can simply drop them.

For those interested in how the primary designed of SEDA views it now (well, in 2010), read Matt’s retrospective on what he’d do differently now. Still, key design decisions have stood the test of time, as most web servers (and many other systems) are designed as a pipeline in which requests are processed in stages, and there are queues connecting the stages. To quote Matt:

The most important contribution of SEDA, I think, was the fact that we made load and resource bottlenecks explicit in the application programming model. Regardless of how one feels about threads vs. events vs. stages, I think this is an extremely important design principle for robust, well-behaved systems. SEDA accomplishes this through the event queues between stages, which allow the application to inspect, reorder, drop, or refactor requests as they flow through the service logic.

Control theory

Cloud offerings are popular because someone else runs the infrastructure needed to host an application. Someone else worries about reliable electricity, adequate cooling, redundant hardware, and other concerns.

When scaling an application up to serve a large number of customers, deploying on the cloud can bring another benefit: auto-scaling. The number of servers, or other resources, like network or disk, that are running an application can be dynamically increased to meet demand.

Without any kind of scaling, we would have to set up our application with all the resources needed to meet peak demand, and pay for that every day, no matter how low demand might be or for how long. We could scale manually, however. This works well when demand is predictable. If the marketing department is making a huge ad buy and a flood of new customers are expected to “kick the tires” or to start using the free version of a product, we can add servers to ensure these new arrivals have a good experience.

Auto-scaling simply automates the scaling. When demand is high, more servers are provisioned (and used, and billed for). But when demand is low, servers are removed from the application, which lowers the bills we get from the cloud company. But is automated scaling actually simple?

The field of Control Theory has many useful lessons for us about how to automate processes despite uncertainties. There are many uncertainties:

Can we combine auto-scaling, which is based on continuously measuring demand, with our knowledge of expected periods of high (or low) demand?
Perhaps demand is rising slowly. Will it keep rising slowly or rise more quickly or change direction and go down?
It normally takes 3 to 5 minutes for our cloud provider to provision a new virtual machine, but occasionally it takes much longer. Is now one of those times?
The data we are getting about current demand has a time lag, and likely has some amount of error in the measurement. When we differentiate to measure how demand is changing, what is the error in our understanding of the rate of change?
How fast does demand change? If it is spiking, but the spike is short-lived, then by the time we provision a new server or two, the demand may be back down to the level it was earlier.

TODO: Insert here resources about basic control theory. Open versus closed loop; bang-bang control; proportional, PD, and PID approaches. Show how these apply to demand-based scaling.

Software Systems Design (FIRST DRAFT)

6. The Web

Main topic:

Investigation questions:

Discussion topics:

Assessing complexity

Modeling and controlling complex systems

Queuing theory

Control theory

Further Reading

6. The Web

Main topic:

Investigation questions:

Discussion topics:

Related topic: Modeling complex systems

Assessing complexity

Modeling and controlling complex systems

Queuing theory

Control theory

Further Reading