In an ideal collision-resistant system, when a new file / object is ingested:
- A hash is computed of the incoming item.
- If the incoming hash does not already exist in the store:
- the item data is saved and associated with the hash as its identifier
- If incoming hash does match an existing hash in the store:
- The existing data is retrieved
- A bit-by-bit comparison of the existing data is performed with the new data
- If the two copies are found to be identical, the new entry is linked to the existing hash
- If the new copies are not identical, the new data is either
- Rejected, or
- Appended or prefixed* with additional data (e.g. a timestamp or userid) and re-hashed; this entire process is then repeated.
So no, it's not inevitable that information is lost in a content-addressable storage system.
* Ideally, the existing stored data would then be re-hashed in the same way, and the original hash entry tagged somehow (e.g. linked to a zero-byte payload) to notate that there were multiple stored objects that originally resolved to that hash (similar in concept to a 'Disambiguation page' on Wikipedia). Whether that is necessary depends on how data needs to be retrieved from the system.
While intentionally causing a collision may be astronomically impractical for a given algorithm, a random collision is possible as soon as the second storage transaction.
Note: Some small / non-critical systems skip the binary comparison step, trading risk for bandwidth or processing time. (Usually, this is only done if certain metadata matches, such as filename or data length.)
The risk profile of such a system (e.g. a single git
repository) is far different than for an enterprise / cloud-scale environment that ingests large amounts of binary data, especially if that data is apparent random binary data (e.g. encrypted / compressed files) combined with something like sliding-window deduplication.
See also, e.g.: