How do content addressable storage systems deal with possible hash collisions?

Question

Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?

To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?

score 1 · Accepted Answer · answered May 16 '15 at 12:19

It is assumed that collisions do not happen. Which is a perfectly reasonable assumption, given a strong hash function and a casual, non-malicious user inputs. SHA-1, which is what Camlistore currently uses, is also resistant to malicious attempts to produce collision.

In case a hash function becomes weak with time and needs to be retired, Camlistore supports a migration to a new hash function for new blobrefs, while keeping old blob refs accessible.

If a collision did happen, as far as I understand, the first stored blobref with that hash would win.

source: https://groups.google.com/forum/#!topic/camlistore/wUOnH61rkCE

Still, it still feels somewhat strange to write a storage software that is unreliable by design (collisions _do_ happen). One usually assumes that only hardware is written with unreliability designed in (this drive will most likely break in ~10 years), while software is designed to be "perfect". — user7610, Jul 31 '17 at 11:16

Jim Grisham · Answer 2 · 2022-02-14T23:09:15.557

In an ideal collision-resistant system, when a new file / object is ingested:

A hash is computed of the incoming item.
If the incoming hash does not already exist in the store:
1. the item data is saved and associated with the hash as its identifier
If incoming hash does match an existing hash in the store:
1. The existing data is retrieved
2. A bit-by-bit comparison of the existing data is performed with the new data
3. If the two copies are found to be identical, the new entry is linked to the existing hash
4. If the new copies are not identical, the new data is either
  1. Rejected, or
  2. Appended or prefixed* with additional data (e.g. a timestamp or userid) and re-hashed; this entire process is then repeated.

So no, it's not inevitable that information is lost in a content-addressable storage system.

* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.

While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.

Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)

The risk profile of such a system (e.g. a single git repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.

How do content addressable storage systems deal with possible hash collisions?

3 Answers3