15

I'm trying to choose a hash algorithm for comparing about max 20 different text data.

Which hash is better for these requirements?

  • Less CPU Consumption
  • Small footprint (<=32 bytes)
  • Collision is not a big deal
  • Can be generated from .NET Framework 2 (shouldn't be a 3rd party library)

I'm using hash for less memory footprint and comparison performance

dr. evil
  • 26,944
  • 33
  • 131
  • 201
  • Can you elaborate on what you mean by "Can be generated from .NET Framework 2", do you mean something that already exists in the BCL or is something that can easily be implemented yourself acceptable? – Robert Gamble Dec 21 '08 at 19:34
  • Can you clarify "Can be generated from .NET Framework 2 (shouldn't be a 3rd party library)" Do you mean "the hash itself must be generated from an algorithm that exists in the framework" or "the Algorithm must be generated from types in the framework"? – Greg Dean Dec 21 '08 at 19:35
  • I mean a native function or library instead of an external huge project or DLL dependency. – dr. evil Dec 21 '08 at 22:14
  • 20
    Have you considered using one or more of the following general purpose hash functions: http://www.partow.net/programming/hashfunctions/index.html they are extremely fast and efficient. –  Nov 05 '09 at 08:38

8 Answers8

9

If collision is not a big deal you can take the first letter of each document. Or you can use the length of the text or the string with the text.

tuinstoel
  • 7,248
  • 27
  • 27
7

Paul Hsieh has a decent, simple, fast, 32-bit SuperFastHash that performs better than most existing hash functions, is easier to understand/implement, and sounds like it meets your criteria.

Robert Gamble
  • 106,424
  • 25
  • 145
  • 137
4

The FNV hash is a well-known fast hashing algorithm. It is not cryptographically secure, but it sounds like you don't need a secure hash.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
2

A very quick check would be to take the length of a text and XOR it with the first 4 bytes of it and use that as a hash. If this is good enough it is extremely fast because independent of the number of bytes of the file.

martinus
  • 17,736
  • 15
  • 72
  • 92
0

If you are constrained to algorithms that exist in the framework

Is MD5 small enough (16 bytes)?

Less CPU Consumption and Small footprint are usually mutually exclusive.

http://en.wikipedia.org/wiki/Time-space_tradeoff

Greg Dean
  • 29,221
  • 14
  • 67
  • 78
  • It's small but CPU usage is not that efficient, I'd rather use MD4 (as a non-brainer) or CRC32 alike stuff. – dr. evil Dec 21 '08 at 19:14
  • That's right. I was confused because it's usually displayed as a 32 digit hex number. – Bill the Lizard Dec 21 '08 at 19:18
  • AFAIK neither MD4 or CRC32 are implemented in the BCL. If don't want to roll your own or use a 3rd party, just pay 15% tax for the convenience of MD5 – Greg Dean Dec 21 '08 at 19:20
0

Check out the serie Peter Karkowski published on his blog.

arul
  • 13,998
  • 1
  • 57
  • 77
0

How long does the hash need to hold for? GetHashCode() is pretty accessible, gives a small response (4 bytes), which should be fine (re minimizing collisions) over 20 strings.

However, GetHashCode() should not be persisted to database - it is fine for in-memory comparisons, though. Just be aware that the algorithm may change between frameworks (and did between 1.1 and 2.0).

The other advantage of this is that it is trivial to use - just use a Dictionary<string,Something>, which will deal with all the hashing etc for you.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
0

I had the same request for myselve and i implemented xxHashSharp . Just make sure you take the appropriate library ( x32 vs x64). It's also available outside of c# here

NicoJuicy
  • 3,435
  • 4
  • 40
  • 66