-1

We're generating hashes to provide identifiers for documents being stored in RavenDB. We're doing this as there is a limit on the length of the DocumentID (127 characters - ESent limitation) if you want to use BulkInsert like so:

 _documentStore.BulkInsert(options: new BulkInsertOptions { CheckForUpdates = true }))

In order for the BulkInsert to work, the DocumentID needs to match the row being upserted; so we want an DocumentID that can be regenerated from the same source string consistently.

An MD5 hash will provide us a fixed length value with a low probability of collision, with the code used to generate the hash below:

public static string GetMD5Hash(string inputString)
{
    HashAlgorithm algorithm = MD5.Create();

    var hashBytes = algorithm.ComputeHash(Encoding.UTF8.GetBytes(inputString));

    return Encoding.UTF8.GetString(hashBytes);
}

However; RavenDB does not support "\" in DocumentID; so I want to replace it with "/". However my fear is that in doing so we are increasing the likelihood of a hashing conflict.

Code I want to change to:

public static string GetMD5Hash(string inputString)
{
    HashAlgorithm algorithm = MD5.Create();

    var hashBytes = algorithm.ComputeHash(Encoding.UTF8.GetBytes(inputString));

    return Encoding.UTF8.GetString(hashBytes).Replace('\\', '"');
}

Will this increase the likelihood of hash conflicts and remove our ability to depend on the DocumentID as "unique"?

Luke Merrett
  • 5,724
  • 8
  • 38
  • 70
  • 2
    Is the idea to be able to re-calculate the same DocumentID based on the `inputString`? That is, later on, if provided the same value for `inputString` you expect to return the same `DocumentID`? Or do you simply want a mechanism to produce a unique `DocumentID` _regardless_ of what `inputString` is? (that is, pass the same `inputString` in later on, it's ok to get a new `DocumentID`?) – Chris Sinclair Jun 11 '14 at 15:57
  • 3
    If you have 127 characters, you have enough to store a [GUID](http://en.wikipedia.org/wiki/Globally_unique_identifier), which sounds exactly like what you want. – Erik Philips Jun 11 '14 at 15:58
  • 4
    Use Base64 instead of UTF8 and you will solve your problem. – ForguesR Jun 11 '14 at 15:59
  • Recalculate the same DocumnentID Chris. So originally the DocumentID was based on a concatenation of the prior values, for example `NetworkName_LocationName_AreaName`. However if you want to perform batch upserts using `BulkInsert` you cannot have an ID over 127 characters. See [this link](http://issues.hibernatingrhinos.com/issue/RavenDB-1304) for context – Luke Merrett Jun 11 '14 at 16:00
  • @ErikPhilips I need an identifier that I can consistently generate from a source string. A GUID will give me the same functionality as using the default RavenDB generated identifier – Luke Merrett Jun 11 '14 at 16:03
  • @ForguesR I do believe you're right, feel free to put that in a full answer whilst I change the code and see if that works – Luke Merrett Jun 11 '14 at 16:04
  • I still don't see how the source string is unique itself. It is not possible to derive a unique value from a non-unique source. – Erik Philips Jun 11 '14 at 16:09
  • @ErikPhilips it's just a case of using a natural key rather than a generated surrogate key in our scenario; means we don't have to hit the database to find out which surrogate key matches our fields. – Luke Merrett Jun 11 '14 at 16:13
  • 1
    Then I would recommend reading [How to Create Deterministic Guids](http://stackoverflow.com/questions/2642141/how-to-create-deterministic-guids). That should reduce the chances of collisions over MD5 (typically by using SHA1 instead of MD5). – Erik Philips Jun 11 '14 at 16:18
  • @ErikPhilips interesting; I had no idea you could do that! Thank you for the link. – Luke Merrett Jun 11 '14 at 16:19

3 Answers3

2

X-Y problem - instead of converting byte array into version that is known to be correctly handled as string with Base64 (or similar) you using UTF8 as encoding.

Reading random byte array as UTF8 string will have non-printable and 0 characters as well random failures due to incorrect UTF8 sequences.

Use Base64 (or base32 if need case insensitive string). If some characters still not supported - replace with other unique ones. I.e. URL-friendly base64 uses -, _ and no padding to simplify encoding as query parameter.

To original question:

  • hash of any kind can't be considered "unique ID" for document due to possibility of collisions.
  • yes replacing one character with another that already could be used in the string will decrease number of possible combinations and increase possibility of collision. I can't estimate it properly - math or statistics specific question may be needed if you really need precise answer.
Community
  • 1
  • 1
Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
2

You increase the probability of collision, but only slightly. All "/" in the output hash are like 'wildcards' which match either "/" or "\" in the raw hash. If you have zero of these in a hash, nothing changes. If you have one of these in a hash, there are now twice as many documents that can match that hash. If you have two in a hash, there are four times as many. Having many more is unlikely given the alphabet and the length of the MD5 hash.

The probability of a collision is still pretty small (unless you have a huge number of documents, etc).

However, you should do what was suggested in comments and use a Base64 or HEX string to store the MD5.

Bad things can happen in cryptography when you 'roll your own' and try and modify protocols which you don't have an inside-out understanding of. You should always always stick to doing standard things which have been tested theoretically and in practice and found to be reasonable. Bruce Schneier puts across this principle at length in Practical Cryptography and elsewhere.

jwg
  • 5,547
  • 3
  • 43
  • 57
1

Use Base64 instead of UTF8 and you will solve your problem (no more /).

Have a look at Convert.ToBase64String.

ForguesR
  • 3,558
  • 1
  • 17
  • 39