Software Systems Design (FIRST DRAFT)

This book is not a traditional textbook. It is a first draft of a guide to the case studies and discussion topics of a new computer science course at North Carolina State University taught by Dr. Jamie A. Jennings.

Preface

Modern software systems exhibit complex dynamic behavior resulting from the interactions between concurrently executing components, including users. Some systems are designed primarily for resilience, or performance, or scalability – or some combination of these and other properties. Sometimes the requirements appear to be at odds with each other.

Complex behavior can be observed even in seemingly simple systems. Application components have non-trivial interactions with operating systems, hardware, and networks. Users can be a particularly challenging component to design for!

Case Studies

In this course, we will examine the components and interactions of the following systems:

Apollo Guidance Computer: A relatively simple system built in the 1960’s for the Apollo moon landing, the AGC hardware, software, and UI were designed together. The technology looks primitive in retrospect, yet the system was resilient in the face of badly behaved external systems and occasional user error. It also made extremely good use of its hardware. How were the designers, particularly the programmers, able to accomplish all of this?
Unix: In the 1970’s, a telephone company abandoned a large government-funded multi-institution effort to build a time-sharing operating system, which would make better use of the expensive computing available at the time. The derivatives of their home-grown alternative, Unix, now dominate data centers and embedded systems, as well as Apple desktops and laptops. How did that happen?
Client/server systems: As networking technology developed in the 1980’s, companies and universities began deploying (wired) ethernet on their campuses. Suddenly, independent computers could talk to each other, enabling a kind of distributed application that has shaped business and consumer services ever since, including what we today call “the cloud”. What protocols and encodings enabled client/server communication?
The Web: In the 1990’s, hypertext finally got traction and everyone wanted to be online, giving everyone else a reason to be online. The Internet, previously a tiny hamlet, was now a rapidly growing megalopolis. It felt like overnight, all the networks grew connections to each other. How do we design products and services that scale to millions of users living all across a large country like the U.S. or across many countries like the E.U.? And how do we share a high-latency low-bandwidth network?
The “three tier applications” of the 2000’s were one response to the questions posed by the web: Bespoke architectures yielded to designs built on commercial software (e.g. IBM’s WebSphere application server and DB/2 database) or nascent open source packages (e.g. Apache HTTP Server and MySQL). Which problems were solved by this approach, and which were not?
Modern globe-spanning architectures: In the 2020’s, large tech companies offer products and services to practically anyone on the planet, provided they have a network connection and (usually) some cash to spare. Given not only the ambitious scale of such endeavors, but also varying government regulations and cultural expectations, how do the tech giants design their systems?

Key Questions

Cutting across all of the case studies are a set of engineering principles including resilience, performance, scalability, and user acceptance. We will choose certain topics on which to focus for each case study. While they may vary somewhat from semester to semester, a list of key questions about software systems design should include the following.

What is reliability and how is it measured?
Is “mean time between failures (MTBF)” a good measure? How can it be calculated for a system made of interacting components?
What does it mean for an application to be down (unavailable)? Is it that no functions work for any user anywhere?
Do we approach reliability (and other properties) differently when designing systems where human life is at risk?
We may think of computing systems as inherently interactive, but this was not always the case. Batch systems were the norm before, say, 1975, and many computer systems today still run “batch jobs” processing astounding amounts of data (e.g. credit card transactions) quickly and reliably. How do the designs of interactive systems and batch systems differ?
How did Unix escape from AT&T, who provided it to universities for free?
Unix survives in the form of Linux, BSD, MacOS, Android, and other operating systems. Why?
For two computers to communicate, they must speak the same protocol. What is a protocol?
To use a protocol, two computers have to agree on a character encoding. For decades, that encoding was ASCII, to the detriment of competing ideas. Now we have Unicode, but that’s not just a set of encodings. What is Unicode, and why do I need to know about it?
Computing technology was U.S.-centric for long enough that someone coined the term internationalization to refer to the idea that other countries exist. What is i18n and how does it affect building software systems?
Come to think of it, what is a country? Network packets don’t know where they are, so does the internet even have a notion of international borders?
Is the web good for anything beyond distributing cat videos? What is the web, anyway? When we order something online, we don’t often think of the planes, ships, trucks, warehouses, and delivery vans needed to bring that 3-wolf tshirt to our door. Or the people working in all of those places. Could it be the same with the online order we placed? What shared infrastructure, regulations, resources, and funding sources made it possible for us to buy something online?
If I run my company using my own server, and it breaks, how can I keep my service online?
If I run my company by renting a cloud server, and its datacenter goes down, how can I keep my service online?
Suppose my company is really successful, and I end up with too much data to put in a single database system. Can I distribute my data?
Does it matter where I keep my data? Should it be close to my users, even if datacenters there are expensive?
My design needs a queue to cope with times when requests come in too fast to be served immediately. How big should the queue be? What response time will users see, and under what circumstances?
What happens if we get more traffic than our queue and servers can handle? Will it just keep slowing down, or will our system crash? Will we lose data? How can we recover?
Can we automate the scaling of our system, so that we use few resources when there is little demand, and more (rented, in the cloud) when demand is high? It takes time to spin up a new server, though. And how many more we need depends on how fast demand is rising, right? Model-based feedback control theory has some answers.
As a system grows in scale, it usually grows in complexity. How do we measure that complexity? How do we manage it?
Should the rapidly expanding use of machine learning cause us to rethink how we engineer systems? If a self-driving car kills someone, what should be done? Suppose that a bug was found – what should we do, if anything, with the output of git blame?
Are all software developers engineers? Some? Which? Why even ask the question?
It seems like everything has software in it, and that everything is capable of interacting with everything else. Where will all this progress take us, as humans? We will speculate about where ever-increasing system complexity may lead.

Last updated on 29 Jan 2024
Edit on GitLab