10

I'm writing an application that uses hashing to speed up file comparisons. Basically I pre-hash file A, and then the app runs and matches files in a folder with previously hashed files. My current criteria for looking for a hash function are as follows:

  • It should be fast enough that disk IO is the limiting factor. I'm currently using SHA-256 which works just fine but is way too heavy and makes my application CPU bound.
  • Cryptography/security doesn't matter in this case, the user is inputting both files, so if they craft a hash collision intentionally, that's on them.
  • Hash collisions should be avoided at almost all costs. I can compare files based on size, and their hash, but if both of those match the files are assumed to be equal. I know it's impossible guarantee this with any hash due to the compression of data, but something with the same sort of uniqueness guarantees as SHA-256 would be nice.
  • File sizes range from 10bytes to 2GB
  • A streaming algorithm would be nice, as I try to keep the memory usage of the application low, in other words I don't want to load the entire file into memory to hash it.
  • Hash size doesn't matter, if I got all the above with 1024bit hashes, I'm completely okay with that.

So what's a good algorithm to use here, I'm using C# but I'm sure most algorithms are available on any platform. Like I said, I'm using SHA-256, but I'm sure there's something better.

Timothy Baldridge
  • 10,455
  • 1
  • 44
  • 80

1 Answers1

9

Yann Collet's xxHash may be a good choice (Home page, GitHub)

xxHash is an extremely fast non-cryptographic hash algorithm, working at speeds close to RAM limits. It is proposed in two flavors, 32 and 64 bits.

At least 4 C# impelmentations are available (see home page).

I had excellent results with it in the past.

The Hash size is 32 or 64 bit, but XXH3 is in the making:

XXH3 features a wide internal state of 512 bits, which makes it suitable to generate a hash of up to 256 bit. For the time being, only 64-bit and 128-bit variants are exposed, but a similar recipe can be used for a 256-bit variant if there is any need for it one day. All variant feature same speed, since only the finalization stage is different.

In general, the longer the hash, the slower its calculation. 64-bit hash is good enough for most practical purposes.

You can generate longer hashes by combining two hash functions (e.g. 128-bit XXH3 and 128-bit MurmurHash3).

Lior Kogan
  • 19,919
  • 6
  • 53
  • 85