I have always loved reading tech-related materials (books, articles). One of my biggest motivations to dive into writing code came from reading Automate the Boring Stuff with Python. At the time, I was working as a field engineer. I said:

"F**k it, this is too cool. I want to do this for a living." Great book!

It's been a while since I read a full book, and I asked a few programmers about their favorites. This one kept coming up. It was clear I needed to know what's in it.

This is not like my usual article where I teach stuff or share things in a more beginner-friendly way. This article contains notes from reading the book and yes, it's indeed a great book.

When building applications, there's just a lot you don't think about (or have to think about), and that's usually a good abstraction. But ultimately, if you're like me, this also leaves a lot of gaps to be filled. This is one of the ways to fill them.

An Interesting Problem

Imagine you're building some kind of game that has a leaderboard. You need this leaderboard to be sorted by score, but not just that. You also need it to be sorted by time, i.e., the earliest to reach the score will always be on top.

I'm sure you're thinking, "Oh, every single database ever can handle this," and you'll be right. One other thing you need is that you are constrained to a 5-second window. This means from when the user clicks "getLeaderboard", you need to do some auth, server code, query the leaderboard, and get the sorted leaderboard. How would you go about it?

Initially, it was an easy problem, but when I got to the point of implementation, oh boy. My eventual implementation is at the end of this article.

P.S. This has nothing much to do with this book; I just thought to add it so you have something interesting to think about while reading my notes.

Notes From Reading The Book

  • Database design is at the core of application code where your choice determines how your application is maintainable, scales, performs, and resists both human and hardware faults. This is why the study of data systems is important.

  • Both relational and non-relational databases have their ups and downsides. But one very positive thing about recent databases is that they have started to converge.

  • Relational databases can now store unstructured data like JSON and geographic fields in PostgreSQL, and non-relational databases also now have the ability to perform joins in an efficient manner.

  • The declarative mode of querying is not only easier to read and write but also is a lot more efficient and can have continuous improvements in its underlying layers without affecting the application code.

  • Graph data structures help handle very large data sets with complex relationships. Relational databases would not handle such data properly because of the complexity of relationships that would often make it much more difficult to read and write as the database grows.

The use of vertices and edges makes traversing and aggregating data easier.

  • Log-structured (SSTables and LSM-Trees) data models and B-Trees models are the most used database types today.

Log-structured databases greatly improve write performance since it's on a basic level just appending to a file (segment). In the background, it merges different segments, removing duplicate data and leaving only the most recent. It handles potential write failures by keeping a separate log while appending to a segment.

B-trees, on the other hand, improve read performance by storing data in a way that allows for fast lookups. In LSM databases, you would need to check segments one at a time to find the data, but in B-Trees, you can find the data in O(log n) time. On the downside, B-Trees tend to have space hanging around due to the insertion technique that breaks segments in half to preserve the key ordering.

  • Data warehouses are different read-only databases that greatly improve analytics. Storing huge amounts of data in traditional databases has the problem where it's difficult to read large amounts of data. Data warehouses are mostly used by big companies, and they work by aggregating data from their different normal databases into a single database called a data warehouse. The data warehouses use column-oriented storage which could span possibly 100s of columns.

  • Text-based encoding (aka serialization) formats like JSON, XML, CSV are the most commonly used due to their readability and multi-lingual support but have the downside of using more memory as well as handling complex data types.

  • Binary-based encoding formats like Avro, Protocol Buffers, and Thrift are the most efficient in terms of storage and network bandwidth. They also have stricter data schemas that can be enforced at compile time. They have the downside of needing to be compiled in order to be human-readable, unlike textual formats.

  • Databases are (and should be) designed to be both backwards and forwards compatible (perhaps forever) due to the difficulties of handling migrations. However, most databases now have methods of altering schemas without rewriting the entire database, except MySQL which would need to rewrite an entire table if the schema changes.

  • RPCs are, in summary, a way to send messages when both the senders and receivers are on the same organizational network but fail in the case of distributed systems as clients usually cannot be forced to upgrade to a newer version if the server changes.

  • Message brokers are mostly one-way asynchronous processes, and they are great for sending messages to many clients at once. They also have the advantage of encoding in whatever format they want and work even better if the format is both backwards and forwards compatible.

  • Replication helps improve the durability and availability of the database. It's an act of putting data on multiple copies of the database so that if one fails, the other can take over.

  • There are three types of replication: Single-Leader, Multi-Leader, and Leaderless.

    • Single-Leader is the most common where there's only one write-access database known as the leader, and then the data is further sent to others (known as followers) once that succeeds (either synchronously or asynchronously).
    • Multi-Leader is similar to single-leader but has more than one database that has write access before being replicated to the followers. One use case of this is when there are multiple data centers (perhaps in different geographical locations).
    • Leaderless is when there's no single leader and the data is replicated to multiple databases at once. Dynamo by Amazon is the primary example of this. The challenge of making this work is how/when to report that the data has been "successfully written" and reaching a quorum for that (i.e., how many writes have to be successful before reporting back to the user that the action is completed).
  • Partitioning has to do with splitting data in nodes, which reduces the amount of load on a single node. It's very helpful to distribute database nodes. However, with partitioning comes the problem of knowing how to structure the partitions to distribute the data evenly.

  • There are a few ways of solving this: key-based partitioning and hash-based partitioning (each with their own pros and cons), with the hash-based type being more commonly used as it prevents skews.

  • Rebalancing is common to reduce any eventual skews and evenly distribute the load. It's a crucial action that needs to be taken with care to avoid data loss, downtime, and creating hotspots (a node with a lot of load) in the process.

  • Transactions are hard. There are a lot of methods to make databases perform operations atomically, but it's not trivial. No dirty reads, no dirty writes, serial execution, two-phase locking, serializable snapshot isolation, and a few more methods are used to solve this problem, but they all have their own pros and cons. One thing that could determine the database of your choice could rely on the type of application and the method your database of choice handles this problem.

  • Distributed systems have to be structured in a way that tolerates failure. Unlike things like supercomputers or single-computer programs where behavior is deterministic, distributed systems are inherently non-deterministic (a lot can go wrong).

  • For starters, system clocks cannot be trusted, and even with the use of NTP, it's not guaranteed that the clocks will be in sync. There's no guarantee you can say an action A must have occurred before action B.

  • All communications between nodes are done over the network, and a lot of failures can occur from packet loss, network congestion, OS operations (like garbage collection pauses), etc. So it's safe to assume that at every point in time in a distributed system, something is already wrong, and the system should still continue to run regardless.

  • Aside from these "honest" problems, there could be "Byzantine faults" where a node deliberately "lies" to the system. Distributed systems should be designed to tolerate these faults as well.

  • Linearizability is a property of a distributed system that ensures operations are executed in the same order. However, it's costly to assure linearizability, so in many cases, they rather enforce "causal consistency". The difference is the former says "all operations should occur in a sequential order" while the latter says "only operations that depend on each other should have that sequence".

  • There are often a lot of scenarios where you need some kind of "consensus". It could be deciding transaction order, deciding node leader, etc. Services like Zookeeper, etcd, Consul, etc., are used to solve this problem so you don't have to implement the algorithms from scratch as it's not an easy task.

  • Batch processing using MapReduce or Dataflow engines use a similar technique to the Unix system where all functions get input from stdin and write output to stdout, which makes it very composable to build complex pipelines.

  • The difference between these two is MapReduce writes all intermediate files to the HDFS at every stage while Dataflow engines only write the final result to HDFS. It uses a similar piping system to start next stages as previous stages are still ongoing without waiting for completion.

  • MapReduce also restarts the entire process on error since the input files are known, and it could restart a job at any stage and get the same output. In Dataflow engines, this is not the case, and if there's an error, it would need to derive the position of the error and restart from there. This is why it is advised to use deterministic operators, i.e., given an input, the output should be the same every time.

  • Business analytics fields benefit from batch processing as they often need to request and organize large amounts of data from a data warehouse. Data science/Machine learning also benefits as a lot of machine learning algorithms have been developed in this way to help with large amounts of data.

  • Stream processing, unlike batch processing, is unbounded, i.e., it receives input data continuously without end. Streams are handled in two ways:

  • By message and acknowledgment where the stream processor sends a message, the stream consumer acknowledges the message, and the stream data is deleted by the message broker. This works well when the order of the data is not important as failure or network latency can affect the ordering of events.

  • By log-based system where the stream processor appends the data to a log and the stream consumer reads the log. This ensures the order is maintained.

  • Unlike batch processing, where the input and output states are immutable, failures cannot be handled by replaying the entire stream from the beginning of time; hence different approaches are used to handle failures.

  • Microbatching, idempotence, and checkpointing are some of the techniques used to handle failures in stream processing.

  • Ultimately, in all data systems, there's usually some trade-off to pay in the way they function, and knowing them will give you better information on what is suitable or not for your project.

Conclusion

I think about a lot of applications now, and the urge to ask big tech people:

"Hey, how do you handle xyz issues?" has increased, which is also a good thing. Thanks Martin Kleppmann. If you would like to do the same, check out the book Here.

Solution I Used To Solve Interesting Problem

First, I knew I wanted to use Redis. Redis uses in-memory storage and is really fast. Next, Redis has sorted lists. You can add a user ID to the list and save by the score. This is where it gets interesting: fetching by the score does not return in the order you may want, i.e., if multiple users have the same score, Redis would return the user IDs in random order.

Write Flow (Saving a user):

  • GET: Request comes in
  • GET: Auth (gets user ID)
  • GET: Fetch user details from Redis (score, username, rank, etc.)
  • If not found, fetch user details from main DB
  • Create composite score with the user score and the current timestamp
  • POST: In one Redis transaction, save the user to the sorted set (leaderboard) with composite score and save the user details (update the rank and score for old users)

Read Flow (GET Leaderboard):

  • GET: Request comes in
  • GET: Auth (user does not really need to be logged in)
  • GET: Fetch the sorted set (leaderboard) say you want top 20. You can limit the fetch to 20 items
  • GET: Loop through and add all the users (in the same order) in a Redis transaction to fetch their details (username, score, rank)
  • RETURN: Show the leaderboard.

There are a few improvements:

  1. If I would use two requests (after auth) each time, then I don't need to create a composite score. I could just use the normal scoring method and then add a scoreUpdatedAt field to the user details. This means, on the read part, I would need to loop through and sort users with the same score based on their scoreUpdatedAt.
  2. The composite score code works but it looked "patchy":
const compositeScore = score * 1e10 + (1e10 - now);

This is simply creating a really high value. If the score is the same, then the difference is gotten from 1e10 - now, but if one is higher, the higher score gets a way higher composite score score * 1e10.

I'm sure your solution may be different. As stated earlier, every mainstream db I know can handle this in some way. (my question is a bit broad anyway and not tailored down to all the specifics of the particular application that was being built).