0

right now, we are looking for generating some unique and deterministic ID for some string value (file URL). Based on this link How to Create Deterministic Guids, looks like that we could create a GUID based on MD5 hash or Sha1 Hash (type 3 or type 5, see GUID wiki page). I did some search on internet as well, I think that it is pretty much same, basically generate a deterministic GUID based on hash.

It looks great when I first saw it, however I am still not comfortable to use it as key to identify something. I think that generally hash is used to check:

  1. whether 2 string matches without revealing the content of the original string
  2. whether some string/file content is changed

Here, even if there is some collision on hash value, it is not great, but it is ok and it would self correct itself once data is changed again and it won't overwrite other unrelated data. However if we use hash as a Primary key to identify some data, a collision would mean that we would override some unrelated data, there is no way to self correct once overwrite happens.

So it seems to me that we should use database to really generate the deterministic GUID here instead of relying on Hash:

  1. have a table at database with 2 columns: str_val, guid_val. str_val is the PK
  2. if we need to generate a guid for string1, we would try to find the record at the table
  3. If we could find a guid, we are done.
  4. If we could not find a guid, we do the insert logic. If insert is failed, most probably it is due to that other thread just inserts one, however it is probably ok since insert race should happen very rarely.

Just before I am going to post my question, I saw this stackoverflow post: How safe is it to rely on hashes for file identification?, it use the hash as the file identification where the accepted answer thinks that it is ok to use hash as the key. Again, I feel that I still need more convincing here for this.

If anyone could give more suggestions, it would be really appreciated.

Community
  • 1
  • 1
windfly2006
  • 1,703
  • 3
  • 25
  • 48
  • Please have a look at this question it may answer your question [http://stackoverflow.com/questions/9727090/unique-identifier-guid-as-primary-key-in-database-design][1] [1]: http://stackoverflow.com/questions/9727090/unique-identifier-guid-as-primary-key-in-database-design – Jake Rote Jan 27 '14 at 19:05

2 Answers2

1

Talking about GUIDs or UUIDs in this context is confusing.

The Key is either

  1. a function of the input (a hash, be it a "deterministic GUID" or otherwise) or;
  2. independent of the input, such as random GUID or auto-increment ID

All hashes for which the range (all possible outputs) is less than the domain (all possible inputs) will have collisions due to the Pigeonhole Principle. However, the idea is to use an appropriate hash for which it is improbable for a collision to occur, which is discussed in the linked questions.

Cryptographic hash functions also have additional goals, such as "it is infeasible to find two different messages with the same hash." (That is, even if someone was trying, it should still be impractical to generate a collision - MD5 fails here and is considered broken; and SHA-1 has been superseded by SHA-2.)

If the Key is independent of the data (e.g. is not a hash) then we might as well use an auto-increment ID - which is deterministic, albeit independent of the data, and guaranteed unique by the database - and call it a day.


So, using a Hash, the key is a form of a Natural Key that identifies data. This makes it possible to query the database with just the hash. And, if we trust that the client generated the hash from data on-hand then it can generally be assumed that the client has the data which results in the hash.

While using an "independent deterministic value", the Key is a Surrogate Key that identifies the tuple. In this case we need to query against the data to find the appropriate data if the "ID" is unknown. This of course requires that the client uses the original data in such queries.

Both of these approaches are valid in the appropriate context, and I use both in database design. (I generally prefer Candidate Keys to impose the multiplicity, but the result is the same.)


Community
  • 1
  • 1
user2864740
  • 60,010
  • 15
  • 145
  • 220
  • Do collisions occur when domain is less than range or when range is less than domain? And by "being less" you mean that the *number of elements* in the set, perhaps it will be useful to clarify that. – Miserable Variable Jan 27 '14 at 19:57
  • thanks @user2864740, I got your point. I guess that it is really a judgement call whether our solution could afford a collision although the chance is very small. At the post http://stackoverflow.com/questions/5525215/how-safe-is-it-to-rely-on-hashes-for-file-identification/5525283#5525283, there is another answer states that points as well. – windfly2006 Jan 27 '14 at 21:31
  • @windfly2006 The idea is that the chance of collision is *improbably* small. Like, there is *less chance of a collision than being hit by a meteor*. See the table at the bottom of [Hash Collision Probabilities](http://preshing.com/20110504/hash-collision-probabilities/). – user2864740 Jan 27 '14 at 21:39
  • @windfly2006 Even GUIDs have this same "chance" of duplication - so it comes down to: should the *hash* of the data be exposed or should an *arbitrary unique ID* be exposed? If the hash is exposed, people can try "probing attacks" against hashes they themselves have generated. – user2864740 Jan 27 '14 at 22:35
  • thanks @user2864740, I am thinking that GUID (type 1) should have a less chance for duplication though compared with GUID (type 3/5). However I do get your points. Thanks very much. – windfly2006 Jan 28 '14 at 02:55
0

You can use ticks to get a unique name from time. I have use both the tick and a random string, then concadenate both and obtain a unique string name for my files. Hope it helps

http://msdn.microsoft.com/en-us/library/system.datetime.ticks(v=vs.110).aspx

Juan
  • 1,352
  • 13
  • 20
  • we want to get the same GUID for same file URL at any given time since there should be only one record for same file URL. So I am not sure if above approach would work. – windfly2006 Jan 27 '14 at 20:46
  • I think when you create a guid the time is considered. I have my concern about if you can recreate a guid. – Juan Jan 28 '14 at 18:41
  • sorry, I guess I didn't make it clear. we would like to be able to create the same GUID for same file URL. – windfly2006 Jan 28 '14 at 21:45