CRDTs use Math to enforce consistency across a distributed cluster, without having to worry about consensus and associated latency/unavailability.
The set of values that a CRDT can take at anytime come under the category of a semi-lattice (specifically a join semi-lattice), which is a POSET (partially-ordered set) with a least upper bound function (LUB).
In simple terms, a POSET is a collection of items in which not all are comparable. E.g. in an array of pairs: {(2,4), (4, 5), (2, 1), (6, 3)}
, (2,4)
is < (4,5)
, but can't be compared with (6,3)
(since one element is larger and the other smaller). Now, a semi-lattice is a POSET in which given 2 pairs, even if you can't compare the two, you can find a element greater than both (LUB).
Another condition is that updates to this datatype need to be increasing, CRDTs have monotonically increasing state, where clients never observe state rollback.
This excellent article uses the array I used above as an example. For a CRDT maintaining those values, if 2 replicas are trying to achieve consensus between (4,5)
and (6,3)
, they can pick a LUB = (6,5)
as consensus and assign both replicas to it. Since the values
are increasing, this is a good value to settle on.
There's 2 ways for CRDTs to keep in sync with each other across replicas, they can transfer state across periodically (convergent replicated data type), or they can transfer updates (deltas) across as they get them (commutative replicated data type). The former takes more bandwidth.
SoundCloud's Roshi is a good example (though no-longer in development it seems), they store data associated with a timestamp, where the timestamp is obviously incrementing. Any updates coming in with a timestamp lesser or equal than the one stored is discarded, which ensures idempotency (repeated writes are OK) and commutativity (out of order writes are ok. Commutativity is a=b means b=a
, which in this case means update1 followed by update2 is same as update2 followed by update1)
Writes are sent to all clusters, and if certain nodes fail to respond due to an issue like slowness or partition, they're expected to catch up later via a read-repair
, which ensures that the values converge. The convergence can be achieved via 2 protocols as I mentioned above, propagate state or updates to other replicas. I believe Roshi does the former. As part of the read-repair
, replicas exchange state, and because data adheres to the semi-lattice property, they converge.
PS. Systems using CRDTs are eventually consistent, i.e they adopt AP (highly available and partition-tolerant) in the CAP theorem.
Another excellent read on the subject.