Skip to content

System design

Created: 2020-07-22 12:08:57 -0700 Modified: 2022-01-22 17:05:02 -0800

It sounds like the overall goal here is:

  • Know a little bit about everything, e.g. how the Internet works, what a load balancer is, etc. You don’t need to know specific technologies, so it’s not important if you’ve used Google’s cloud and you’re interviewing at Amazon.

Resources that I used to help learn

Section titled Resources that I used to help learn

High-level summary on my thoughts about system design

Section titled High-level summary on my thoughts about system design

System design covers everything from the high-level architecture like “we need a load balancer in front of horizontally scalable web servers” to lower level design like “there should be three classes in this set of code files and they have these responsibilities”.

As such, you can think of system design as being one system composed of many subsystems, where even most of the subsystems can be looked at that same way (i.e. recursively). For a more concrete example, consider the mechanism by which applications scale assuming you aren’t deploying on an existing cloud platform—you would have to have some service that keeps track of the metrics by which you’re scaling (e.g. CPU utilization, network traffic, etc.). That service would be subject to the same constraints as any other system that you’d design:

  • Which parts of CAP does it need to satisfy? E.g. does it need to be consistent?
  • Is it a single point of failure?

Note that in reality, the mechanism for scaling is typically some kind of orchestration endpoint, e.g. a Kubernetes cluster. You may want an active-active setup, an active-passive setup, etc., but that one system in your overall design (i.e. “the scaling system”) is itself an entire system to design.

If you’re interviewing for a company, it’s important that you stay at the level of detail that the interviewer expects. You could drill down into the question of “how do I make a load balancer?”, but the interviewer may want you to examine the parts of the problem that you’d be solving on the job, and it’s pretty likely that the job already has load balancers in use or can get them from a cloud platform.

That’s why it’s important to talk about trade-offs. If there were a single solution that fit everyone’s needs, then design interviews would be replaced with understanding exactly how that solution worked.

Example regarding that high-level summary

Section titled Example regarding that high-level summary

I wanted to figure out how I would implement a pubsub system like existing ones out there (e.g. Pusher or Solace’s PubSub+). I got really stuck because I kept introducing more systems like a load balancer, something to scale the system, etc. However, by looking at things that way, I wasn’t considering just the subsystem in question—the pubsub itself.

Instead, I think it’s good to focus on how the API may look, then figure out how you would scale this. For pubsub, the API when used on a single machine (e.g. between unrelated code components) is the same as the API when used across multiple machines:

  • pubsub.publish(‘event_name’, eventData)
  • pubsub.subscribe(‘event_name’, handlerFn)

That’s it. As for the transport layer, you could implement this over TCP or UDP. I assume it’s done through TCP since reliable delivery is important, but again, this is a time where you’d talk about trade-offs.

Short version (only read this to get an idea, but the medium version expands on this nicely)

Section titled Short version (only read this to get an idea, but the medium version expands on this nicely)
  • Understand the system, its users, and any other parameters or constraints
  • Do a high-level design of the subsystems involved
  • Dive deeper into the most important parts (or ask the interviewer what they care about seeing)
  • Refine any systems and handle scalability issues, single points of failure, or any bottlenecks
  1. Always start by understanding the problem, the users, any constraints, etc.
    1. Say what is out-of-scope. Typically, handling payments, signing up for accounts, etc. may be fine to hand-wave away for now.
    2. Estimate usage patterns, potentially even at the granularity where you define storage/RAM/cache/whatever needs. You may want to ask your interviewer if you should do back-of-the-envelope math, e.g. “we’ll have 10M writes per month and 2 Kb per write, so we’ll have 20 GB written per month”.
      1. A good thing to think about is how traffic is distributed; are there power users generating tons of traffic that would make for better caching?
  2. Identify the core of the problem, i.e. what are the biggest system-design challenges behind this problem?
  3. Design at a high level. Is there a LB? Is there a caching layer? How do services communicate? Which APIs will exist? What data will we store?
    1. While doing this, talk extensively about trade-offs, especially around consistency and availability.
    2. Make sure you haven’t missed any pieces.
    3. Don’t dive into details too fast; make sure we’re focused on solving the core problem. The database schema could be important, but if you’re trying to scale a service to handle a million realtime connections, then it’s probably more important to focus on that aspect.
  4. Prioritize where to dive deeper. Do this based on the problem we’re trying to solve. For example, if we’ve identified that caching is the core of the problem, then diving really deep on sharding, replication, consensus, etc. for the caching layer would be nice.
  5. Estimate capacity and throughput needs.
  6. Identify bottlenecks. Is there a single point of failure? Can your system handle the traffic you expect?
    1. The general advice is to talk about how you would benchmark and profile to find bottlenecks before addressing them in the real world, but you may have a good idea of where those bottlenecks will exist ahead of time (e.g. if you know that you’re designing Twitter and that Twitter has 100M users already).
    2. Make sure you’re not just saying “I would scale X” without addressing the why and how. Sometimes it’s just decoupling that’s needed rather than strict scalability, e.g. don’t have your web layer act as your CDN if the web layer is the bottleneck.

For me, I’ve realized that I should stick with text/typing for as long as possible since I don’t work well with diagrams.

Section titled Long version (reference, ref2 was a Google Doc sent by Google which I can’t link here)
  • Constantly think aloud just like with technical interviews
    • Breadth vs. depth
      • The interviewer will probably eventually want depth somewhere, so make sure to keep them in the loop about how you’re thinking. Keep breaking down the problem and covering as much breadth as possible while asking where they want you to go deeper (unless they want you to do that prioritization).
      • Find a good level of technical detail—not too abstract and not too detailed
  • Fully understand the problem and the hypothetical environment around the problem (AKA “gather key requirements”):
    • Users
      • Who are the users?
      • How do they interact with the system? E.g. an API, a phone app, a kiosk, etc.
      • Where do they live?
    • What’s the budget?
    • What’s the timeline?
    • How much data is it going to handle?
    • How does it make money?
  • Talk about the trade-offs extensively (EVERYTHING IS A TRADE-OFF). Did you go for an SQL database over NoSQL? Why? Did you favor performance over consistency?
  • Be prepared to justify every decision you make
    • Don’t be afraid to question your design though and change a decision, especially if you came across a new constraint or requirement
  • Google-specific tips
    • Talk about your experience with a technology if relevant
  • Prioritize! It’s better to talk about the big things than to bikeshed.
  • Consider the main APIs that you may want in your system and who would call them (reference)
  • Estimate capacity/throughput needs (reference)
    • This table may help you with specific numbers, e.g. the round-trip time from CA → Netherlands → CA is 150ms. It’s probably better to just know relative times, that way you can compare whether it’s better to compress data or to just send it uncompressed over the network.
    • Good things to remember:
      • There are roughly 2.5 million seconds in a month.
      • Hard drives take 20x the time to read compared to an SSD, and an SSD takes 4x the time that RAM does (so a hard drive is 80x the time that RAM takes).
  • Cover all tenets of software (some will be higher priority than others)
    • Security
    • Accessibility
    • Localization - system for localizing string codes, formatting strings, UI bugs like text truncation or RTL languages
    • Scalability - fault tolerance, coordination, synchronization, communication
    • Maintainability - logging, monitoring, back-ups, upgrades/rollbacks, cost to maintain (either infra, dev work, or other)
    • Geographic - CDNs, having multiple instances of infrastructure
    • Performance / latency
    • Legal
    • Administration - how do you make changes (e.g. enable/disable features), how do you support users, how do you add content
    • Functional needs

Specific techniques for checking breadth

Section titled Specific techniques for checking breadth
  • Put yourself in the shoes of a user.
    • What do you expect out of the software?
      • Be careful not to turn this into product design; the product is what the interviewer is telling you about, not necessarily asking you about. Instead, I think this helps highlight areas you may not have thought of.

It’s a good idea to separate responsibilities. Your web layer should only handle taking in web requests, passing them off to the entity that will fulfill them, then bundling the results back to the client. This is generally what’s done with lambda/serverless setups where your web layer will simply invoke a function in a managed environment that is completely separate from the web layer itself. This allows your web layer to remain as responsive as possible and also makes it easier to reason about since it only has one task.

There are a couple of ideas for this:

  • Task queue - this is a simple solution and is probably what everyone does. You would make a task queue with at-least-once delivery and only remove the item from the queue when something has handled it.
  • Consensus algorithm - you have a concept of a “leader” who could handle the one-off tasks. I don’t think this would be very common compared to a task queue, but it’s possible.


  • WHEN: these are essentially “push” notifications for updating a system without that system having to poll for updates. You typically send these to systems you don’t control (e.g. a third-party payment system may update you via webhooks) since otherwise you’d likely use a queue.
  • HOW: the system that wants to be updated sets up a server, passes a URL to the updating system, and the updating system attempts to call into it on some event (e.g. a user making a purchase).
    • The webhook is publicly accessible, so the updating system will typically sign all requests so that the receiving system can validate the signature first before processing.
    • The receiving system may be offline momentarily when the sending system emits a request. Thus, requests are typically retried every so often until an overall timeout is hit.
    • The receiving system may hang, respond with an error, etc. This means the sender may send the same request multiple times, in which case the receiving system has to be able to handle duplicate requests.

The answer to every single system-design question of “would you prefer X over Y?” is “it depends”. I know that sounds silly, but it puts you in the right mindset; trade-offs are inherent to designing systems. If it’s fast, then is it using a large amount of space? Is it consistent? Available? Does it cost a lot of money? Etc.

One basic trade-off of adding practically any system, layer, service, etc. is that it’s one more thing to manage. It’s another system that can fail and needs to be accounted for when it fails. This is worth mentioning since some team of people will have to design, create, and maintain the system, and that all takes time and skill.

When doing real-world system design, this is clearly worth considering, but for pie-in-the-sky kinds of ideas, it may be worth saying what would be the “perfect” solution if you had infinite resources.

Any distributed system can only have two of these three properties at the same time:

  • Consistency: every read receives the most recent information (or an error). Note that this is called “strong consistency” and that “eventual consistency” is not good enough to satisfy the “C” in “CAP”.
  • Availability: every request gets a valid, non-error response (even if it’s not consistent)
  • Partition tolerance: the entire system operates even when network failures partition it
    • Imagine a system of 3 machines. A network fault may make it so that A and B can communicate, but not A and C, so from C’s perspective, it may think A is offline even though it isn’t.
  • Availability vs. partition tolerance (reference): availability is concerned with individual nodes giving responses, whereas partition tolerance is concerned with the overall cluster/system giving responses.

You’re typically debating between consistency and availability since you always want your system to be available, thus you must have partition tolerance.

  • It’s possible to have an AP system (i.e. availability and partition tolerance are present) that has something like 99.999% consistency, which doesn’t violate the you-can-only-have-two-properties constraint, but it’s pretty close to satisfying all three.
    • See Cassandra, a NoSQL database (note: that link doesn’t fully explain how they did this, but apparently clients can specify which kind of consistency they want per query. For more information, read this.).

Favor consistency over availability when eventual consistency isn’t good enough or you require something like database transactions’ all-or-nothing property (e.g. interacting with purchase data probably favors consistency over availability).

Favor availability when uptime or potentially performance is more important. Remember that availability is about individual nodes, so the overall system should still be online, but without being highly available, you may not be getting a response at all from some particular nodes (or you may get an error).

Note that you can satisfy different parts of CAP at different times.

  • Weak consistency: a best-effort approach to make data consistent. Note that this definition isn’t widely agreed-upon, so if you say something has weak consistency, then you likely need to clarify exactly what you mean (it’s like saying “a load balancer is cheap” without a frame of reference for costs).
  • Eventual consistency: asynchronous consistency where reads will eventually return the newest data (e.g. DNS)
  • Strong consistency: reads will always return the newest data

For high availability (HA), you can have fail-over and/or replication.


  • Active-passive
    • Only one endpoint is handling traffic. The other is either “hot” if it has the ability to handle traffic but just doesn’t or “cold” if it has to load data upon the active endpoint failing.
  • Active-active
    • Multiple endpoints are handling traffic. If public, DNS would need all of their IP addresses.
  • As shown above, “active” just means “is actively handling traffic”.
  • When a system fails, it usually means that the availability takes a hit for some portion of time and that responses may result in errors.

Replication (see database-specific replication notes here)

  • One primary endpoint and N secondary endpoints
    • Secondary endpoints are used only for reading, not for writing. If the primary goes offline, then we have to promote one of the secondary endpoints to be the new primary.
  • Multiple primary endpoints
    • Both primary endpoints can serve reads and writes, but they collaborate on writes (i.e. they replicate writes between them). If either goes offline, the other can continue to operate.
    • In order for your application to know where to write, you’ll either need to modify their logic or add a load balancer above the primary endpoints.
    • You generally sacrifice consistency here unless you synchronously update all endpoints at once. E.g. for a database, you may have a constraint that is only violated once data flows in from the other system.
  • Read-heavy systems
    • If you expect to handle more reads than writes, then you want as many nodes as possible handling reads, and you wouldn’t care as much about writes being slower. In my mind, this just means adding more read-only nodes, replicating writes to them, and then balancing traffic across them (i.e. load balancing), so I don’t think there’s some magical technique or structure for optimizing for reads.
  • Primary vs. active: note that “active” does not mean the same thing as “primary”. “Active” means that the endpoint is handling traffic. In a primary-secondary system, both nodes are active, but only the primary one is handling writes.
  • Impact on consistency
    • Replication strongly affects consistency. If, for example, you write to a primary endpoint and then try reading from a secondary endpoint immediately, then you will only have strong consistency if you have synchronous writes. With asynchronous writes, you could only have eventual consistency.
    • The way this is typically done across multiple endpoints is to have secondary endpoints acknowledge the completion of some operation from the primary. If not acknowledged, then the primary will communicate the change to the secondary again (note that I’m not trying to make a statement on exactly how this works, e.g. the secondary may request it from the primary, or the primary may timeout when reaching the secondary, etc.).
      • For information on how exactly Redis does replication, check this out. They mention that while the primary and secondary are both online, they can just stream commands from one to the other. If there’s a disconnect, then they do a partial resync if possible, and if not, they do a full resync.
        • Redis can accomplish this replication asynchronously by default since it aims for high availability and performance over strong consistency.
  • In short: DNS converts domain names like “” into an IP address that your computer will ask your router (which then typically turns into a series of routers) to connect to.
  • DNS is hierarchical. There are few authoritative servers at the top level.
  • Each DNS server manages its own cache of records. Records can become stale based on their TTL.
  • Using DNS to distribute traffic
    • DNS supports listing multiple IP addresses for a particular domain name. Most clients will always take the first one. This allows services to permute the list of IP addresses to get round-robin style distribution.
    • Beyond that, the DNS server itself is simply taking a domain name as its input and returning an IP address as its output, so there could be logic in between to return a specific IP address based on latency or geolocation.
      • Note that if you do anything “special” at the DNS layer, then your server would have to be the one to handle the DNS request (meaning that if another server is hit or if a cache is used, your special logic wouldn’t be run). Thus, if you NEED this special logic, then you’re probably coding your system to always hit your “DNS” server (which I don’t really think of as true DNS since other DNS servers may not have the same logic).
      • Since you can use DNS for distributing traffic, it becomes a consideration when comparing against other solutions to load balancing or geolocation.
      • For example, a hardware load balancer would have already received all of the traffic to it (i.e. bandwidth) in order for it to route anywhere else, whereas DNS is queried before even sending the business traffic to the server.
  • DNS can also be done over HTTPS (or TLS), e.g. if you want to make sure that even the lookup of a domain name is encrypted from others on the network.
  • In short: CDNs are proxies that are typically closer to a particular consumer, so content can be downloaded faster once cached.
    • Also, because a proxy is handling the request most of the time, it doesn’t need to contact your servers at all, freeing their resources up to do other tasks.
  • A client knows which endpoint to connect to in the CDN by virtue of DNS servers returning the closest IP address to them. Note that this is a slight simplification; it could also be based on load, legal compliance, etc.
  • A CDN may be behind a load balancer, but it typically wouldn’t be behind the same load balancer that your web layer is behind.
  • Push vs. pull
    • Push CDNs: you upload content directly to the CDN when there’s a change or update. The CDN itself wouldn’t necessarily have a time-based cache at that point—it would just have your files.
    • Pull CDNs: you upload nothing to the CDN, and the CDN will figure out whether it needs to contact your server based on its current cache.
      • The content being stored could be stale if you update it before the TTL expires. For example, you upload cats.png to your CDN with a TTL of 10 minutes, but then immediately reupload it with better cropping. A pull CDN wouldn’t know to pull it again unless you invalidate its cache prematurely.
  • Because you’re serving content through a CDN, the URL will change from its source URL, which means you’ll have to update clients if you introduce a CDN.
  • In short: load balancers help distribute traffic to servers behind them.
  • Load balancers themselves have to be high-availability systems, meaning they may have an active-passive setup or an active-active setup (with replication of any state needed).
  • Hardware vs. software
    • Hardware load balancers can be expensive (like $20K+ expensive)
    • Software load balancer examples: HAProxy and nginx.
  • Traffic distribution
    • Traffic can be distributed to workers via any number of methods: round robin, random, based on CPU usage, etc.
      • Layer-4 distribution: this just looks at headers of packets, e.g. source and destination IP addresses.
      • Layer-7 distribution: this is based on the contents of the packets themselves, meaning they can inspect cookies, HTTP headers, etc. For example, they could detect that you’re accessing the “/users” URL and point you to a different server.
        • This takes more time to do since not only is there more to inspect, but also, you have to decrypt any encrypted traffic (see SSL termination below).
  • SSL termination
    • In order for load balancers to be able to examine traffic, they need to decrypt it first. This means they need a certificate to do so, which means you have to trust your load balancer.
    • By decrypting the traffic in the load balancer, your workers don’t have to do it, which can save them time (and also means you don’t have to give them a certificate).
    • The load balancer can also encrypt the workers’ responses with an X.509 certificate.
  • In short: this is a proxy intended for use by your servers rather than for use by the clients. The clients may not even know that such a proxy is being used.
    • By this definition, load balancers frequently act as reverse proxies since they’re the point of contact for the client.
  • Some desirable properties:
    • You could hide information about your specific back-end by only exposing reverse proxies to clients.
    • …and pretty much everything that a load balancer does since a load balancer can be a reverse proxy.
      • This is where some definitions start to get fuzzy; a reverse proxy may perform some firewall functionality, some DNS features, etc.
  • Unlike a load balancer, a reverse proxy could make sense in front of just a single server.
    • You may want to obscure information about your back-end
    • You may want to protect against vulnerabilities (it may be easier to secure the reverse proxy than the server itself)
    • You may want to modify headers/traffic before forwarding requests
    • You may want to cache or compress information
  • ACID (reference) - used to describe database transactions:
    • Atomicity - each transaction is all or nothing
    • Consistency - ensures that each transaction moves the database from one valid state to another, e.g. without violating foreign keys, uniqueness constraints, triggers, etc.
    • Isolation - executing transactions concurrently has the same results as if you executed them sequentially
    • Durability - once data is written, it’s guaranteed to exist even after a system failure (i.e. it’s written to non-volatile memory)
  • Write-ahead logging (WAL) (reference)
    • This is a technique where data is first written to a log. The changes must be committed to local storage before being written to the database, that way you achieve durability.
    • This is typically how databases achieve eventual consistency is by keeping track of where in the log each other node in the system has acknowledged.
  • Reads vs. writes
    • In most systems, reads greatly outnumber writes (e.g. 100:1 or 1000:1) (reference).
  • Scaling
    • Replication
    • Sharding (reference)
      • This is partitioning your data such that each shard only manages a subset of the data. You typically shard by some property of the data, e.g. last name, ID, location, load incurred on the system, etc.
      • Advantages
        • You to write in parallel to each shard.
        • By partitioning data:
          • Searches are typically faster
          • You have more cache hits
      • Disadvantages:
        • Figuring out which shard to contact
          • ”Database router” vs. storing in the application layer
            • Your application will somehow need to know which shard to reach out to for specific data. This could involve adding a “database router” that knows which shard houses which data. If such a “router” is added, then another hop will exist between the application layer and the database layer (decreasing performance). If there isn’t such a “router”, then the application layer itself will have to know whenever shards change to handle different data.
            • Note that this “router” could just a be a “primary” shard.
            • You could also have a deterministic routing function (e.g. all user’s with a last name starting with “A” go here, “B” goes there), but then you wouldn’t be able to dynamically partition data so that, for example, high-traffic users go to their own shard.
            • I feel like having a “router” makes the most sense in practice (in most cases) so that the application layer doesn’t need to be modified to know where to route queries.
        • Shards can become imbalanced (e.g. if 1% of the users generate 90% of the traffic). You can rebalance shards in that case.
          • In that case, you typically go through some process like writing to both the old and new shards at once, then slowly phasing out the old shard. Alternatively, you could just split the shard into two immediately by replicating any data needed at the point of creation.
        • Joining data can be more difficult.
    • Federation (AKA “functional partitioning”)
      • This is just sharding by function, e.g. you split your database across endpoints such that there’s one for users, one for forums, and one for products. As such, it has all of the same pros/cons as sharding.
    • Denormalization (reference)
      • This is a way of writing data in multiple ways or locations such that reading becomes less computationally expensive (e.g. avoiding the need for a join). The writes are thus redundant since they don’t need to be stored more than once.
      • Disadvantages:
        • If you write more than you read, this is not going to save you any time.
        • You have multiple sources of truth, which means they need to stay in-sync. Materialized views help with this; they’re essentially virtual tables that you can query that know their dependencies and thus can figure out when they’re stale.
          • Unlike regular views (like in MySQL), materialized views are stored on-disk.
    • SQL tuning (reference)
      • This is general section on how to improve performance of SQL queries, e.g.
        • Use better indices
        • Use the right primitive types (e.g. CHAR vs. VARCHAR vs. TEXT)
        • Denormalize when you have expensive joins
  • Replication consistency
    • As mentioned earlier in these notes, for strong consistency, synchronous writes are needed. For eventual consistency, the replication target will have to acknowledge the replicator’s change. If not, some record will have to be kept that the replication wasn’t successful so that it can be repeated.
  • Full-text search (e.g. Elasticsearch)
    • In short: if you want to text searches over a database (e.g. a Craigslist-like search for “bikes in New York”), then you would need a specialized database.
      • ”select * from listings where product_name like ‘%bikes%’;” won’t cut it for complex queries, mistyped queries, etc.
    • As for how the specialized database works, it’s not too important. It’s a document-oriented database that can return fuzzy matches with an associated relevance score.
    • There are trade-offs here. For example, the indexing typically lags behind the source database by some number of seconds. Thus, you wouldn’t treat your text-search database as the source of truth, so you would usually use it in conjunction with a traditional SQL database.
      • This pipeline of extracting data from one database, transforming it so that another system understands it, and then loading it into that other system is called ETL.
      • For Elasticsearch, it sounds like you would replicate data to it based on your system’s needs. For example, if you want new product listings to show up soon (~15 minutes), you would need to stream data to it. However, maybe you get a dataset from another company and only want to replicate that once per day.
        • You could use a queue for handling this if you needed, that way Elasticsearch could fall behind a bit and not lose data.
  • Data warehousing
    • In short: data warehouses are used for reporting and analysis.
      • They’re typically integrated data, meaning they have data from many sources, and they’re usually separate from the production database powering user-triggered queries (i.e. you can do all of the analytics on this database you want without slowing down your production systems).
      • These usually come with a SQL-like query language.
      • They’re usually good at querying very large datasets since their purpose is typically to power business-intelligence systems.
    • Replication
      • You would replicate to your data warehouse every so often (e.g. once per day).
    • Hive specifics (although these probably apply to other data warehouses as well)
      • Supports all four ACID properties of transactions
      • Schema-on-read: this is how Hive achieves fast loading times (where “load” is the “L” in “ETL”, meaning data is being stored in the database). Typical databases use schema-on-write, which means that data is validated against the schema at the time that it’s entered into the database. Schema-on-read means that it’s validated at the time the data is read. This means that query performance is a bit slower, but loading the data in will be faster.
        • This means that data can be loaded into Hive that breaks the schema, but you won’t get an error until you try reading it. At that point, you’ll get something like the “HIVE_BAD_DATA” error (reference, reference2). That that point, you could fix it by changing the schema.
    • Sample technologies that are data warehouses: Hive, Impala, Presto, Redshift, AWS Athena (which is at least somewhat compatible with Hive)
  • Indexing
    • Databases are almost always concerned with performance, and the way they achieve good indexing performance is typically with b-trees (balanced n-ary trees), but write-heavy databases may use log-structured merge-trees (essentially multi-level hashmaps).
  • SQL vs. NoSQL
    • NoSQL
      • NoSQL usually favors eventual consistency over availability. The acronym here would be “BASE” (reference):
        • Basically Available: the system is available most of the time
        • Soft state: the state of the system may change over time, even without input
        • Eventual consistency
      • Types of databases
        • Key-value store: this is like a large hashmap essentially. It has constant-time retrieval and storage. Operations are usually simple, so complexity is usually shifted to the application layer.
        • Document store (sometimes very closely related to key-value stores to the point where differentiation is difficult): you store blobs of data (e.g. JSON blobs) and can query the internal structure of a document.
          • Some databases try to emulate some SQL features like joins, but they usually have different limitations from regular SQL databases.
          • Some databases (like Firestore) put indexes on all of the fields of a document to make for quick lookups.
        • Wide column store (e.g. Google’s BigTable or Amazon’s DynamoDB): hierarchical grouping of key/value pairs into columns, column families, and then super columns. The columns and column families also have a row ID associated to them for lookup.
        • Graph database: essentially a graph persisted to disk; you have nodes and edges to represent relationships between data. This is especially helpful for heavily interconnected data (many-to-many relationships). Social networks may use these to represent social graphs.
      • Memory caches like Redis are a form of NoSQL database (usually a key-value store or a document store).
  • Distributed transactions
    • In general, a distributed transaction needs to maintain the same all-or-nothing behavior that a regular transaction would guarantee. In order to do this across systems, it means that each local transaction must be held until the final transaction is complete, otherwise a local transaction may be committed despite a dependent transaction failing.
      • [09:35] Bremir13: Sagas, 2PC, etc. all do the same thing conceptually - break a sequence of operations into self-describing states, which can be deterministically continued or rolled back.
    • SAGA is a pattern that has each process doing its own local transactions and is typically implemented either with events (where a system says “I created the payment record, so now someone has to deliver the product”) or with an orchestrator, where the orchestrator knows which database handles each transaction.
      • Like with distributed transactions in general, each subsystem in this flow needs to be a transaction that gets rolled back if any subsequent subsystem fails.
  • Optimistic vs. pessimistic locking (reference)
    • In short: optimistic locking is when you check to make sure the version of your data hasn’t changed when you go to write it. If it has, it means that something else changed it first.
      • Pessimistic locking is obtaining the exclusive lock to your table or records before you do anything.
    • Optimistic locking is apparently good for high-traffic systems since locking can be expensive.
  • Back-ups
    • Back-ups are different from replication. Back-ups are performed at specific intervals (e.g. once per day), and they guarantee that the only data that can be lost is data added/changed/deleted since the last back-up.
      • Back-ups are typically in a completely different datacenter from the live servers, that way there isn’t a single point of failure in terms of a power outage or fire in the physical datacenter.
        • Note that back-ups, at their core, don’t have to be in different datacenters and that some people classify that aspect as being “disaster recovery”. Typically you would replicate from your back-up somewhere else since there can be quite a lot of data.
      • In a live system, deletions/edits will also be replicated and thus will not be undoable.
  • In short: store data to prevent a more expensive calculation later or to shift processing from one node to another.
    • Simple example: if you find that you’re reading from the same database shard disproportionately compared to other shards, then you may want to cache that data so that the database doesn’t even need to be contacted.
  • Qualities
    • You need to determine the level of consistency you want between the cache and the source of truth (i.e. the database). If you’re okay with eventual consistency, then just having a TTL on cache items could be fine. If you want strong consistency, the database would have to update the cache (which I imagine would be done rarely and only for specific pieces of information if it were done at all).
  • Locations
    • Caches can be basically anywhere: CDNs, reverse proxies, the client itself, databases, in-memory caches like Redis or memcached for the application layer.
  • When to update the cache (these do not have to be mutually exclusive)
    • Cache-aside (reference)
      • Look for entry in cache
        • If not exists, load from database
        • Store in cache
      • Return entry
    • Write-through (reference)
      • Application only interacts with the cache (including for writing data)
      • The cache interacts with the database synchronously so that the cache is just a reflection of the database
    • Write-behind or write-back (reference)
      • This is just an asynchronous write-through strategy
    • Refresh-ahead (reference)
      • The cache itself predicts what should be refreshed as a function of access time/frequency and expiration time, then it refreshes that data automatically.
  • ETags (MDN Reference, Wikipedia reference)
    • ETags, or entity tags, are some way of identifying a specific resource and version (typically done by hashing the resource’s contents, hashing the last-modified timestamp, or just using a revision number (reference)). They can be used to compare whether a resource has changed, which is useful for caching, preventing overwrites (see “mid-air collisions”), and even updating resources at the database level.
      • Because ETags are designed to identify the resource at a particular version, a new ETag should only be assigned if the resource changes.
  • Ejecting information from the cache
    • Caches only hold so much data, so you need to eject information from the cache at some point. You may do this based on which elements were least frequently used, least recently used, first-in-first-out, etc.

I’m not actually sure that this will ever be part of a system-design question, but I’m writing this here anyway since I took the time to learn about it.

  • Protocol buffers (reference) (AKA “protobufs”)
    • Note: I’m not positive whether the term “protocol buffers” is generic (like “tissues”) or implementation-specific (like “Kleenex”). I’ll describe the overall idea here:
    • You have an interface descriptor on two endpoints. The descriptor looks something like this (as pseudo code):

X is an integer

Y is a string

  • Since both endpoints have the descriptor, it doesn’t need to be transferred each time you want a message to be serialized according to the descriptor. Instead, you can pack everything tightly into a binary stream of data and send it over the wire.

  • gRPC uses protocol buffers for remote procedure calls.

  • Other technologies: Apache Avro, Apache Parquet

  • Message queues
    • At its core, this is just to receive, hold, and deliver messages.
    • Queues are typically used for event-based systems, e.g. a user signs up, so you send them an email, or someone makes a purchase, so you produce a PDF receipt.
    • The consumer of a message will typically signal to the queue that it was handled, that way it can be removed from the queue. Messages that aren’t handled could potentially be added to a dead-letter queue.
  • Pubsub (AKA “message brokers”)
    • Pubsubs are like message queues in that they deliver messages to a “topic”, and any number of listeners can subscribe to that topic. Just knowing that something is a pubsub doesn’t say anything about its deliverability. I.e. it may fail to deliver even once, or it may have at-least-once delivery.
  • Task queue
    • This is built on message queues for a producer/consumer way of thinking of workflows:
      • An application publishes a job to the queue
      • A worker takes a job off of the queue, processes it, and signals that the job is complete. Note that this step doesn’t have to involve telling the publisher that it’s complete (although doing so could be part of the job itself).
  • Deliverability
    • Some queues/pubsubs could potentially drop messages or deliver them multiple times. This is the deliverability. If a system can deliver messages multiple times, then the receivers need to make sure that subsequent calls are idempotent, meaning they don’t cause additional side effects (i.e. only the first call will likely have side effects since there was some reason to listen to the message in the first place).
  • Back pressure
    • When a queue fills, back pressure is the concept of erroring on attempting to add new data so that the client is responsible for trying again later.
  • OSI model (reference)
    • Layer 7: application - this is for high-level APIs and business logic
    • Layer 6: presentation - translation of data between a networking service and an application, e.g. character encoding, compression/decompression, encryption/decryption.
    • Layer 5: session - managing communication sessions (i.e. multiple transport-layer packets being interpreted)
    • Layer 4: transport - reliable transmission of data segments. This includes segmentation, acknowledgment, and multiplexing.
    • Layer 3: network - the packet level. This involve structuring and managing a multi-node network, including addressing and routing.
    • Layer 2: data link - reliable transmission of data frames between two nodes connected by a physical layer.
    • Layer 1: physical - transmission and reception of raw bit streams over a physical medium.
  • If you want a mnemonic for PDNTSPA, you can use “Please Do Not Throw Salami Pizza Away”.
  • TCP
    • Reliable
      • Thanks to checksums and sequence numbers, packets that arrive will be processed in the order they were sent, and packets that are corrupted will be re-sent.
    • There’s a handshake over a socket and acknowledgments of packets.
    • There is also congestion control, e.g. for combining several small packets into one larger packet (see Nagle’s algorithm).
    • Due to these constraints for reliability and congestion control, TCP generally takes longer for packets to arrive and be usable than UDP.
  • UDP
    • Connectionless - datagrams are sent to the destination. They may arrive, they may not. They may be out of order.
    • Because this isn’t reliable, you would use this for something like video transmission, VoIP, etc. where losing some datagrams isn’t the end of the world.
  • This is when you call a function in another address space (typically another machine entirely). The details of the call are abstracted so that it looks essentially like you’re calling a function locally.
  • REST vs. RPC: RPC is just calling a function on another machine, so REST calls are RPCs. However, REST exposes data through resources and uses HTTP as the protocol.
    • [14:34] M3talstorm: @Adam13531 noooo REST is not like RPC, REST you say “I want my data to look like this” (and the server works out what to do to make that happen) RPC is “Do this action and here are the parameters” (just like a function
    • [14:39] M3talstorm: @Adam13531 it’s like you see so many ‘REST’ APIs that do something like /user/:id /disable rather than doing a PUT/PATCH request to /user/:id that just contains { ‘disabled’ : True }
  • Encrypt in transit and at rest
  • Sanitize inputs
  • Prevent against common vulnerabilities, e.g. CSRF, XSS, IDOR, SQL injections.
  • Use the principle of least privilege.

Most resources on system design don’t seem to delve into routers, so I don’t think it’s too important. Here are some very high-level notes:

  • Routers have a routing table, which is a list of known addresses for forwarding packets to their destinations. If a router doesn’t know how to route traffic, it has a way of communicating with other routers in the network so that it can still deliver traffic.
  • Anycast is a methodology by which one IP address can route to any number of destination endpoints. This is how Google’s DNS ( works; it’s one IP address, but they have hundreds of endpoints behind it. It’s also how CDNs could work, although DNS itself could also handle that, e.g. needs to resolved to an IP address, and the DNS server could be what’s converting it to an IP closest to the source traffic (as opposed to anycast which would happen at the router layer).

Like routers, firewalls didn’t really come up a lot in system-design prep materials, but I think it’s still worth mentioning.

  • In short: firewalls are a set of rules to allow or deny traffic based on some criteria.
  • They are implemented at layers 3 (network), 4 (transport), and 7 (application). Depending on which layer they’re at, the criteria on which the rules are based differ. For example, at layer 7, you can actually examine the data in the request itself, e.g. to rate-limit users who aren’t logged in.

Consensus (AKA “distributed consensus”)

Section titled Consensus (AKA “distributed consensus”)

Consensus (more specifically distributed consensus since consensus in a non-distributed system is less complex) is getting multiple nodes to agree on some data or fact. Nodes within a cluster may not have the right data, but there’s some way of getting the right data.

In general, this is used for data like the following:

  • Service discovery - you know that some endpoint exists for handling machine learning, but you don’t know where it is. Note that Consul, service-discovery software by HashiCorp, uses Raft under the hood (reference).
    • Shard locations - you want to know which shard is responsible for a particular piece of data.
  • Leadership elections - you want to know which endpoint is a leader.

Consensus vs. consistency

It’s not that consistency never comes up, it’s just that the cluster as a whole is what is consistent, not individual nodes. If you find yourself wondering about consistency, then you may be trying to use a consensus solution as a database solution, which you shouldn’t do for performance reasons:

  • In a consensus approach, most of the nodes need to agree on some value, not all of them. In a database with anything other than weak consistency, you at least eventually want all nodes to agree on the same values.
  • If you approached a database with a consensus algorithm, then in order to maintain consistency, you’d need an extra step. For example, in Raft, you’d always have to go through the leader since it’s the only one guaranteed to provide a strongly consistent value. By doing so, all traffic would be filtered through a single node which wouldn’t benefit from the use of other nodes except for redundancy/failure cases.

Common implementations/algorithms

  • Raft (introductory visualization here, visualization sandbox here) - a minimum of 3 instances are needed, but they could all be on the same machine if you wanted. Raft systems usually have an odd number of nodes since getting the majority to agree on something is important, and there is no guaranteed majority among an even number of actors (although the more nodes you have, the more could fail).
    • Nodes could have inconsistent information, although the entire cluster will have a majority of the nodes with a consistent answer. Thus, if you want a consistent answer, you’d go through the leader (by making the query a log entry (reference)).
  • Paxos - a family of protocols
  • Zookeeper Atomic Broadcast (ZAB) - similar to Paxos
  • Technically a blockchain is a form of consensus, but it typically involves many nodes you don’t actually trust

(these notes follow this video)


This is from 2013, but the concepts are still up-to-date in 2020.

I don’t think this is worth watching if you went through one of the massive text links like this GitHub book since they both cover the same data, but I think the text makes it a little bit easier to follow (e.g. you don’t have to listen to some un-mic’d student give an answer).


  • 0m - ~7m - waste of time to watch; they cover web hosts and VPS providers and what you’d want to look for.
  • ~7m - ~14m - the only important thing is the concept of vertical scaling, which is increasing the resources for a particular endpoint (e.g. more RAM, more cores, more disk space, etc.)
  • ~14m - ~25m
    • Horizontal scaling - scaling out with more machines rather than stronger machines. By adding more machines, we need to balance traffic/workloads to the back-end instances.
      • The load balancer itself gets a public IP address, meaning the back-end instances don’t even have to be public.
      • The LB also needs to figure out how it’s going to distribute traffic (e.g. based on CPU usage, round robin, random, etc.).
      • At its simplest, the load balancer could be considered a DNS server that just returns different IP addresses.
  • 25m - 30m
    • Caching - the client will cache the IP address returned from DNS based on its TTL (as will pretty much all nodes between the client and an authoritative DNS server).
  • 30m - 35m
    • When balancing load to multiple servers, they either have to be stateless or you need to balance to the same one each time in order to make use of state (i.e. sticky sessions). Making them stateless is generally what’s done, and any state needed will be in an in-memory cache or the database.
  • 35m - 44m - he talks about RAID as an analogy for data redundancy in a distributed system.
  • 44m - ~47m
    • This data redundancy in a distributed system will typically be a database as the source of truth, where the database has its own replication logic and methodology for scaling.
  • 47m - 51m - LB stickiness
    • The LB can insert a cookie with a random number to identify which back-end instance to stick a client to in a sticky session. This is just one such technique for stickiness.
  • 51m - 1h8m - caching
    • You can cache the output of any static content (“compiled” PHP or whatever you call it, HTML files, etc.), cache queries in MySQL, etc.
    • He talks about memcached to save the results from the database.
    • The cache will eventually get too big to fit on its machine, so we expire objects based on some metric (e.g. LRU, LFU, FIFO, etc.).
  • 1h11m - 1h20m - replication
    • All replication is to prevent a SPoF (if one copy fails, there’s another copy as a back-up). In can also help distribute traffic.
    • Primary-secondary: the primary handles writes (and potentially reads), and all secondaries handle only reads. If the primary goes offline, we have to promote a secondary.
    • Primary-primary (and potentially also have secondaries): writes can go to either primary, and they’d get replicated between them, but reads usually go to the secondaries. If one primary goes offline, you still have another primary.
    • Either way, if there’s any sort of replication, then your application servers need to know which database instance to contact, which means, at the very least, a registry of IP addresses, and at most a load balancer (which could be the same as the one in front of the application layer since you can route based on the source and destination).
  • 1h20m - 1h22m
    • Having just a single load balancer means you have a single point of failure, so typically, the LB itself will have redundancy in an active-active or active-passive system.
  • 1h22m - 1h24m
    • Partitioning data by some dimension can be helpful, but it’s hard to join or transact across those partitions.
  • 1h25m - 1h40m
    • By designing your infrastructure correctly, your code doesn’t have to be aware of the infrastructure itself. E.g. if you have a load balancer in front of your database (or if your database software handles replication/sharding for you), then you can just connect to the LB’s IP address and route your queries through that. The DB itself can then figure it out.
    • Always look for SPoFs, even if it’s a hardware switch (which, in that case, you’d just have two hardware switches and connect cables from two Ethernet ports on each computer).
    • Your load balancing could be at the DNS level to a whole availability zone or geographic location for a datacenter. You would definitely want stickiness at that point since the memory caches likely wouldn’t be shared across datacenters.
  • 1:40m - end - security
    • A firewall would only allow TCP (ports 80, 443) traffic in, maybe SSH.
    • SSL termination at the load balancer would mean that the back-end only needs TCP (port 80). The database is typically TCP traffic on port 3306.
    • This concept of tightening everything is the theory of least privilege since people will probe for weaknesses.