25

I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is

http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d

where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.

Does anybody have views on the pros and cons of this approach?

For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.

I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.

rdmpage
  • 941
  • 2
  • 8
  • 9

7 Answers7

12

It's perfectly fine.

Accidental collision of MD5 is impossible in all practical scenarios (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second, for 100 years).

It's such an improbable chance that you're trillion times more likely to get your data messed up due to an undetected hardware failure than due to an actual collision.

Even though there is a known collision attack against MD5, intentional malicious collisions are currently impossible against hashed URLs.

  • The type of collision you'd need to intentionally collide with a hash of another URL is called a pre-image attack. There are no known pre-image attacks against MD5. As of 2017 there's no research that comes even close to feasibility, so even a determined well-funded attacker can't compute a URL that would hash to a hash of any existing URL in your database.

  • The only known collision attack against MD5 is not useful for attacking URL-like keys. It works by generating a pair of binary blobs that collide only with each other. The blobs will be relatively long, contain NUL and other unprintable bytes, so they're extremely unlikely to resemble anything like a URL.

Kornel
  • 97,764
  • 37
  • 219
  • 309
  • 4
    Exactly. You still need to be aware that deliberate MD5 hash collisions are *easy* to do. If this is going to be a problem you would be better off using SHA-1 or some other newer hash function that isn't as broken as MD5. – paranoidgeek Jan 01 '09 at 03:20
9

After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.

Community
  • 1
  • 1
rdmpage
  • 941
  • 2
  • 8
  • 9
1

Multiple strings can produce the same md5 hash. Primary keys must be unique. So using the hash as the primary key is not good. Better is to use the GUID directly.

Is a GUID suitable for use in a URL. Sure. Here's a GUID (actually, a UUID) I jsut created using Java: 1ccb9467-e326-4fed-b9a7-7edcba52be84

The url could be:

http://example.com/view?id=1ccb9467-e326-4fed-b9a7-7edcba52be84

It's longish but perfectly usable and achieves what you describe.

Steve McLeod
  • 51,737
  • 47
  • 128
  • 184
  • But if I want to use the primary key in a URL I'll need to reformat it, as opposed to just using the md5 hash. I'm looking to keep things simple. – rdmpage Oct 21 '08 at 09:29
  • Why would you need to reformat it? The Internet Archive uses URLs verbatim, as substrings of the URLs of archived pages. – Hugh Allen Oct 21 '08 at 10:03
  • rdmpage -- Md5 is not unique. Hence, it's not fit for a primary key. – MichaelGG Oct 21 '08 at 10:03
  • 2
    But if the probability of a collision is low, then this isn't a big issue. Plus, unlike a UUID I can guarantee that I can regenerate the same md5 hash from the same URI, which is useful if the GUID is generated at different times, for example. UUIDs seem to address a different problem. – rdmpage Oct 21 '08 at 12:20
  • Hugh -- But some URIs are very messy, may contain parts that are url encoded and parts that aren't, etc, making life difficult. A hash seems a lot easier to manage. – rdmpage Oct 21 '08 at 12:24
0

MD5 is considered deprecated - at least for cryptographic purposes, but I would suggest only using md5 for backwards compatibility with existing stuff. You should have a good reason to go with md5 when we do have other hash algos out there that aren't (at least yet) broken.

Problems I see with the approach:

  • Duplicate objects, because the url identifier is different (As arend mentioned)
  • URLs changing

The latter being the one that might be important - this could be done as simply as a remove and an add. That is, if these ids are never visible/storable outside the database. (Like as a component of a URL.)

I guess these won't be a problem for DOIs.


How would it work with a non-autonumber integer id setup, but where the offline inserter agent creates the numbers? (Can use a dedicated range of numbers, maybe?) Might have a problem with duplication should two users independently add the same url?

MaHuJa
  • 3,148
  • 1
  • 18
  • 6
  • 1
    He is asking about indexing, not cryptographic. It's very different if the collisions are deliberate or randomly. In the end it depends on the question if he can live with a random collision. – Lothar Sep 08 '14 at 16:32
0

Maybe this document is something you want to read:

http://www.hpl.hp.com/techreports/2002/HPL-2002-216.pdf

MysticSlayer
  • 380
  • 3
  • 7
  • Interesting read, but slightly different problem. Furthermore, in my case the content may well be edited, but I'd want to keep the URI unchanged. The approach in the HP tech report would require me to change the URI with each edit. – rdmpage Oct 21 '08 at 08:52
0

Often lots of different urls point to the same page. http://example.com/ example.com http://www.example.com/ http://example.com/index.html http://example.com/. https://example.com/ etc.

This might or might not be a problem for you.

  • I do indeed have multiple identifiers for the same object (articles often have more than one presence on the web). That's not a huge problem, I'm not relying on people generating the hash to find out whether the URI exists in my database. – rdmpage Oct 21 '08 at 08:53
-1

md5 hash is almost unique, but is not totally unique unique so don't use it as primary key. It is depreciated for cryptographic use. There is less chance of key collision, but if you have pretty big database with billions of rows, there is still some chance of collision. If you insist using hash as primary key use other better hash. You cannot use non unique values for Primary Key. If you have pretty big table, don't use it. If you have small table, you might use it, but not recommended.

Prabhu
  • 5,296
  • 4
  • 37
  • 45