right now, we are looking for generating some unique and deterministic ID for some string value (file URL). Based on this link How to Create Deterministic Guids, looks like that we could create a GUID based on MD5 hash or Sha1 Hash (type 3 or type 5, see GUID wiki page). I did some search on internet as well, I think that it is pretty much same, basically generate a deterministic GUID based on hash.
It looks great when I first saw it, however I am still not comfortable to use it as key to identify something. I think that generally hash is used to check:
- whether 2 string matches without revealing the content of the original string
- whether some string/file content is changed
Here, even if there is some collision on hash value, it is not great, but it is ok and it would self correct itself once data is changed again and it won't overwrite other unrelated data. However if we use hash as a Primary key to identify some data, a collision would mean that we would override some unrelated data, there is no way to self correct once overwrite happens.
So it seems to me that we should use database to really generate the deterministic GUID here instead of relying on Hash:
- have a table at database with 2 columns: str_val, guid_val. str_val is the PK
- if we need to generate a guid for string1, we would try to find the record at the table
- If we could find a guid, we are done.
- If we could not find a guid, we do the insert logic. If insert is failed, most probably it is due to that other thread just inserts one, however it is probably ok since insert race should happen very rarely.
Just before I am going to post my question, I saw this stackoverflow post: How safe is it to rely on hashes for file identification?, it use the hash as the file identification where the accepted answer thinks that it is ok to use hash as the key. Again, I feel that I still need more convincing here for this.
If anyone could give more suggestions, it would be really appreciated.