5

Possible Duplicate:
How do I create a HashCode in .net (c#) for a string that is safe to store in a database?

I'm planning to store hundreds of thousands of URLs in my database. Each row in my UrlInfo table will be immutable with the URL itself being the logical primary key. Since URL's can be fairly lengthy, I've decided to hash the URL as a quick way to find possible matches when adding new rows. The hash isn't my true key, just a way to find possible matches quickly. In addition I'm using a RegEx pattern per domain that distills the essence of the URL into something that can be compared against other URL's. I'm storing the results of the RegEx also as a hash and I'm not concerned if it yields possible duplicates.

Everything has been going great until I learned that C#'s string.GetHashCode() method, which I've been using to hash things, is not guaranteed to be unique across implementations of .Net. I noticed this when I tried moving my hash function from ASP.Net to SQLServer CLR code. The web app uses .Net 4.0 and SQLServer 2008 R2, I've learned, uses .Net 3.5. They yielded separate hash results for the same string so now I need to get away from using string.GetHashCode() because I don't want to have to worry about this changing when I upgrade my app to to future versions of .Net.

So, questions:

  1. Does my architecture smell since I'm storing a hash in my db? Any better ways? Obviously Microsoft doesn't want me to store hash results!

  2. Can someone recommend a good C# replacement algorithm for hashing strings? I saw Jon's here but not exactly sure how to modify to work for strings (loop through each char using ascii codes?).

  3. Are there any string compression algorithms that would be better than using a hash algorithm?

Thanks

AWESOME RESPONSES BY MANY. THANK YOU VERY MUCH!!!

Community
  • 1
  • 1
sisdog
  • 2,649
  • 2
  • 29
  • 49
  • I know this doesn't answer your question, but would be very cautious about using a URL as a key for anything long-term. My experience is that they change for any number of reasons. – n8wrl Mar 18 '11 at 16:33
  • They can, though a URI essentially IS a key to a distributed global system. As long as your use of it can handle things changing, it can be a good approach in many cases. Sadly, it works a lot better in databases that have their own hash indexing. – Jon Hanna Mar 18 '11 at 16:38
  • Hey @n8wrl, for my app it truly is the URI itself that I'm interested in, not necessarily what it stands for. I'm using my secondary RegEx/hash strategy to mitigate the problem with possible variations of URL's. So for instance, I can match these two strings because my RegEx for domain abc is only using param a: `http://www.abc.com?/a=5` <-matches-> `www.abc.com?a=b&pagecount=2` – sisdog Mar 18 '11 at 16:49

5 Answers5

3

You could always use an MD5 hash instead, which is relatively fast:

public string GetUrlHash(string url) {

    byte[] hash = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(url));

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < hash.Length; i++) {
        sb.Append(hash[i].ToString("X2"));
    }

    return sb.ToString();
}

Call it like so:

Console.WriteLine(this.GetUrlHash("http://stackoverflow.com/questions/5355003/storing-c-gethashcode-in-db-is-unreliable"));

And get:

> 777BED7F83C66DAC111977067B4B4385

This should be fairly reliable from an uniqueness standpoint. MD5 is insecure nowadays for password applications but you don't have that problem here.

The only problem is using a string like this as a primary key on a table might be problematic, performance-wise.

The other thing you could do is use the URL shortener approach: use your database's sequence generation feature, and convert the value (make sure you use the equivalent of LONG or BIGINT!) to something like Base36, which gives you a nice, concise string.

kprobst
  • 16,165
  • 5
  • 32
  • 53
  • thx so much for the answer. After a lot of thought today, I ended up with your suggested MD5. I considered (1) a flat VARBINARY field and (2) a BASE36 compression but MD5 is very well know and SQLServer has native support for it with their HASHBYTES function. I'm happy to know that my hash values will be consistent for the life of my immutable objects. – sisdog Mar 18 '11 at 23:34
  • Instead of Encoding.ASCII, that is "incomplete" and "not-reversible" given non-ASCII characters, you should use Encoding.UTF8 – xanatos Mar 18 '11 at 23:48
  • @xanatos true, especially if the URL contains extended characters. I updated the snippet. – kprobst Mar 18 '11 at 23:49
1

A similar question has been asked here:

How do I create a HashCode in .net (c#) for a string that is safe to store in a database?

It may prove to solve your issue.

Community
  • 1
  • 1
Adam Houldsworth
  • 63,413
  • 11
  • 150
  • 187
  • Thanks @Adam, somehow I missed that one in my stackoverflow search. Yea, it confirms what I was thinking. I guess my architecture is still sound but I need to find a good replacement for string.GetHashCode(). – sisdog Mar 18 '11 at 16:42
  • In that question Jon Skeet provides a good algorithm that hashes strings. – as-cii Mar 18 '11 at 16:45
1

As a note, SQL Server from 2008 supports (has) the function HASHBYTES, that given some data (a string for example) can generate a MD2, MD4, MD5, SHA or SHA1 hash.

xanatos
  • 109,618
  • 12
  • 197
  • 280
0

I'd say you probably don't need to store the hash.

Just make sure that you correctly index the URL column in your table (a unique index) and the searches should be fast.

iggymoran
  • 4,059
  • 2
  • 21
  • 26
  • I don't know if I can, or want, to do that iggy. These URL's could get long. Some URL's have params hanging off of them that represent text, for instance, from a posted form. Sqlserver won't allow a varchar greater than 900 for an index, I wasn't going to limit myself to 900. – sisdog Mar 18 '11 at 17:32
0

Have you considered zipping down your string and storing a VARBINARY? It could be much smaller, you could build an index directly on it.

A-K
  • 16,804
  • 8
  • 54
  • 74
  • I thought able this @Alex but it still will grow large as my string grows large. I chose MD5 because it will have a low collision rate and will keep my data smaller. – sisdog Mar 18 '11 at 23:36