Building distributed, scalable, systems is complex – and, even more common thanks to the growing number of IoT and mobile devices. A problem I routinely see is developers attempting to treat distributed systems like monolithic systems. Specifically, developers who hold on to patterns such as ACID.
ACID, short for atomicity, consistency, isolation, & durability, represents a core pillar of relational databases (RDMSs). For many applications, the behaviors described by ACID, and enforced by RDMSs are exactly what is desired. Nobody wants such race conditions as dirty reads. However, you’re reading this because you’re presumably interested in distributed systems – in short, you’re not building the typical application, and certainly not the types of applications envisioned in the 1960s when Codd et. al. laid the groundwork for modern RDMSs.
Most of the elements of a classical database design revolve around the concept of a central authority (usually the RDMS itself) gate-keeping all reads and writes. This works well when all clients can maintain a connection to the central authority (e.g., a database connection) anytime it may desire to perform a write. Provided you can meet this requirement, and provide you can scale this singular bottleneck as client number and usage grows, then this isn’t a bad solution. In fact, it is a great solution, as the rigidity of ACID exists for a reason, and RDMSs protect against all type of poor data scenarios.
Clearly you’re not building something that is “always connected”, and therefore strict ACIDity cannot be achieved. What then?
To answer this question, let’s step back and ask why what we’re trying to accomplish and by extension what we care about. For many systems this can be articulated by the simple statement:
“I want to ensure any updates are processed correctly, and if a conflict arises that the application does the right thing.”
OK, well enough. These requirements clearly define the user’s point of view, but obviously leave a little wiggle room for interpretation – we’ll run with this and evaluate the ambiguities as they arise.
Requirement 1: “ensure any updates are processed correctly.”
Assuming that “correctly” means “when an update arrives, process it the same as if the client was always-connected” then we can accomplish this pretty easily. Simply have the offline client queue their update requests, and when they achieve an online state, simply transmit them to the server, iterating over them in chronological order, processing each one. Provided no conflicts, this is pretty simple – though, as we’ll see, there lies the rub.
Requirement 2: “if a conflict arises do the right thing”
As you’ve probably surmised, this is where things get complicated. Conflicts come in many forms, so we’ll define each one in turn, and then discuss general solutions for all these problems.
If multiple clients attempt to update the same record, what do you do? What if they update the same field, or perhaps different fields for the same record?
Many applications forgo strict database schema normalization for performance optimizations, and sometimes this takes the form of computed fields. The sum of child records, the state of an object (e.g., state-machine states), etc. What happens when conflicting updates (see above) also have an effect on what is otherwise a computed field?
Ideally you should be avoiding hard deletes, and implementing some form of logical deletion, but how do those patterns hold up with conflicts?
How do we proceed?
With the preceding breakdown of the types of conflict out of the way, let’s review how we can overcome this in a disconnected system.
Thankfully an easy solution exists due to the similarity of each of these conflict types – and, for that matter, all conflicts. All conflicts are essentially timing errors. Thing A was supposed to precede thing B, but due to life, the universe, and everything we get B before A. You know this because you’re a human. You can visualize these scenarios.
For example, a scenario involving a liquid storage provider, who utilizes a tablet-based application to record daily audits of all their storage tanks, including tracking measurements. Imagine then, if:
- Fred measures a storage take prior to the daily shipment being received, and had recorded the daily measurement at 47 liters. Fred, however, is offline during this time due to lack of proper Internet connectivity at the storage facility.
- Alice, sitting in the office, updating the fluid level of the same storage tank from the prior days reading of 50 liters to 150 liters because she just received notification from a field technician that their daily delivery was just received and the delivery driver notified Alice of the correct reading after topping off the tank.
- Later in the day, Fred finally gets connected to the Internet, and his previously recorded readings can be transmitted.
In this scenario, despite Fred’s transmission occurring after Alice’s update, our human intuition is that Fred’s update should be ignored. Perhaps kept, in case of an audit, or an over demanding boss wondering why Fred is never submitting his readings for this tank, but certainly we do not have the expectation that Fred’s readings should supersede the one’s that Alice input into the system.
This is because Fred’s updates occurred chronologically before those recorded by Alice, and therein lies our solution. Our system should preserve and respect the chronological order that updates should have occurred, not necessarily when they were received by the central server. In fact, the time that an update is received by a server is more of a technical artifact, albeit a curious one, but certainly not one of interest to average users of the system.
To rectify this, we need to add another field to each model we intend to permit offline updates on. This field should track the precise timestamp that the user indicated that they wanted an action to occur. Let’s call this the “submission date”.
Revisiting our above example, the submission date for Fred’s update might be 9am on November 1st, whereas Alice’s might have been 11:45am the same day. If we had captured those timestamps, when Fred’s record arrives, we can clearly see that he’s submitting a value that was intended to arrive much earlier. In fact, we can even see that we have a newer record, and can choose to disregard his update.
At this point you can see the importance of this additional timestamp. Our system is accruing a lot of timestamp, we’ve got updated, and created dates to help facilitate down sync, and now we’ve added a submission date to facilitate up-sync. This is simply because, during all types of synchronization, the single largest problem is that of a chronological inconsistency – timing errors. We can prevent this by maintaining good bookkeeping, and since the subject of our concern is time, then that means we’re simply keeping a lot of timestamps as bookkeeping to ensure the proper flow of data.
What about deletes?
We’ve already set the basics up for handling deletes. Using our newly added submission date timestamp, we can now easily ignore updates that happen for deletes records… if the timing is correct. You see, despite having added a new timestamp, we’ve glossed over yet another timing issue. Deletions bring them front-and-center, because deletions are actually logical deletions (you’re not hard deleting records, are you?), and as such involve an update to a column.
In fact, the conflict of a deletion occurring before/after another update is simply a special case of two updating occurring, but on different fields of the same model.
Stated another way, if we can solve the scenario of Alice updating the tank’s daily fluid reading, concurrently to an offline update from Fred attempting to update the same tank’s pressure reading, then we can also solve deletes. Let’s look at these scenarios closer and see what we’re missing and why.
- Fred visits storage site 4, and recording a daily pressure reading for storage tank B – let’s say 60 PSI. Fred is offline, however, so his updates will not be transmitted yet.
- Alice receives a call informing her of the receipt of additional liquids for tank B at site 4. She is online, and immediately updates the fluid reading to 150 liters.
- Fred finally gets internet conductivity, and transmits his pressure readings from earlier in the day.
A lot of this comes down to how we structure our data, let’s make this even more concrete. Assuming our schema looks like this:
By the time we arrive at step 3, we’d have records that look like this:
If our system, noting it’s earlier timestamp, simply ignored Fred’s submission then we’d have just thrown out data as only Fred’s record contains the pressure reading. What to do?
Entity Attribute Value to the rescue
Luckily for us a pattern exists to solve this exact problem, and it’s called the Entity Attribute Value pattern. The EAV pattern is actually pretty simple, and once integrated can greatly simplify the complexities around up-syncs.
At its core, EAV says that Entities (so the representation of a reading for a fluid tank) which contain multiple Attributes (the fluid level and pressure reading of our tank) should be represented by separate Values when recorded. In short, we should revise our schema to that one record says “fluid level is 150 liters” and another says “pressure is 60 psi”, but neither says both. Confused? Let’s look at how the above schema might look under EAV and go from there:
In this revised schema we’re using the “reading” field to serve dual purpose. Sometimes a reading is a pressure, other times it’s a volume. How do differentiate the two? That’s what the reading type field is for. In this example type 1 represents the pressure reading, and type 2 represents the fluid level. These types are entirely application specific, so don’t get hung up on that, but rather observe how in this revised schema we’ve essentially pivoted our record, and allow the preservation of both Alice’s and Fred’s update without any conflict what-so-ever.
In fact, if we use this schema and loop back in our earlier example of Fred also recording the fluid level, we get something very interesting:
Here we’ve not only used our new schema, but we also allow Fred’s earlier pressure reading to be stored into the database. Since we’re using both the EAV pattern, and have a proper submission_date timestamp, when we query for the correct/current fluid reading, we can clearly see that Fred’s fluid reading occurred earlier and ignore it. However, we still recorded it for auditing purposes. Also, and perhaps more importantly, it’s easier. Anytime we’re selective updating data we run the risk of mistakes. However, if we write everything out into what is essentially a giant audit log, we can (at query time) resolve the conflicts in real-time – each newly saved record retroactively resolves all prior conflicts.
That last bit is worth discussing a little bit more.
While our examples so far have been pretty simplistic, timing updates can get pretty complicated. Roll the dice enough, and you can have multiple updates from the same person, some online, some offline, other updates from different users, etc., all occurring in parallel. You could, on each newly received record evaluate all the existing records in the system, and determine if you should insert or update that row, but what happens if, as you’ve evaluating if Alice’s update should be recorded, another update from a different user (say Bob) arrives. If you’re running multiple web servers, like almost any modern web application, then the logic that is evaluating if Alice’s record should be saved is running concurrently to Bob’s, and both of ignorant of the other – more conflict!
Instead, the above pattern allows both Alice’s and Bob’s update to be written to the database, and when anyone else reads the data, only the record with most recent timestamp (as determined by the submission date) is used.