Fast low-collision non-crypto hash in JavaScript for Files

Question

I am looking for a fast hash with low collisions implemented in JavaScript. It doesn't need to be a crypto hash. I am basically using it as a way to see if a given file has already been uploaded (or partially uploaded) to a user's account to save them some upload time on large (video) files.

I am using the new HTML5 File API to read in slices of the file. I then hand this off to SparkMD5 to give me a hash of the file. I like the fact that SparkMD5 allows me to do an incremental hash so I don't have to read the entire thing in memory.

Overall, SparkMD5 works for my needs but for large files it can take a while to get me my hash (about 30 seconds for a 300MB file). I would ideally like to reduce this. I am not that knowledgeable on hash functions so I am not looking to port something and am ideally looking for an already implemented library.

What would be an acceptable duration? You could look into CRC32, which is meant to be faster than MD5, but it may not be noticeable and you'll probably get a higher collision rate. — Graham, Apr 25 '13 at 22:28
Yea, I looked at CRC32 but read somewhere that the collision rate is %0.4. I am not knowledgable enough to know if this is true but there seemed to be others indicating it had a high collision rate as well. — Eric Anderson, Apr 25 '13 at 22:30
To answer your question I ideally would like it to just take a few seconds even for a 1GB file. I don't know if that is realistic. — Eric Anderson, Apr 25 '13 at 22:31
OK. How many videos might each user have? Why not checksum the first 1MB of each file that gets uploaded and store that in your DB. When a user selects a video to upload, calculate the checksum for the first 1MB and compare that against the user's DB entries. If there's a match then you'll need to checksum the rest, but in most cases you won't find any matches and the user can proceed to upload. — Graham, Apr 25 '13 at 22:33
Well, I am partly thinking about a resume function. It won't always be a video and I want to ensure if they take a file and make minor edits that I don't resume in a way that misses their edits (I basically want it treated as a new file if they make any change). — Eric Anderson, Apr 25 '13 at 22:35
If files aren't likely to have identical first 1Mb then you could hash only the first 1Mb, for example — Patashu, Apr 25 '13 at 22:50
The problem is: * User starts uploading file A and stops party way through * User changes file A (maybe after the first 1MB) * User resumes uploading. My code misses the change because it doesn't realize the file has changed. — Eric Anderson, Apr 25 '13 at 23:02
I'm thinking the best option might be based on your suggestion. If I do a MD5 hash on the first chunk + the file size then this will have a low probability of collisions and detect if the file changes. The only issue is if the file changes but remains the same size. The chance and consequence of that happening is low enough to ignore it for my purposes. Thanks for the brainstorm. — Eric Anderson, Apr 26 '13 at 20:07
[cyrb53](https://stackoverflow.com/questions/7616461/generate-a-hash-from-string-in-javascript/52171480#52171480) is designed to do exactly what you ask; uses 53 bits (maximum of single JS numbers) optimized for speed, and obviously less collisions than typical 32 bits. — bryc, Nov 16 '20 at 19:52

Jo Liss · Accepted Answer · 2021-08-27T12:27:37.877

Update (August 2021): My benchmark predates WebAssembly and is therefore out-of-date. There likely exist hash functions compiled into WebAssembly that beat the pure JS implementations. If somebody wants to update this benchmark, a pull request to my benchmark repo would be most welcome!

I benchmarked various hashing algorithms, and here's the fastest options I found:

If you only need 32 bit digests, use iMurmurHash. Note that this will give you collisions after about 2**14 (16,000) hashes.
Use SparkMD5 if you need more than 32 bits. I didn't find a fast 64 or 128 bit Murmur implementation, but SparkMD5 was surprisingly fast (75 MB/sec).
- If you need incremental updates, consider joining strings into larger chunks before feeding them to SparkMD5, as SparkMD5's incremental hashing seems to suffer from some moderate overhead.

These recommendations are for pure JavaScript. I benchmarked them with strings, but SparkMD5 takes ArrayBuffers as well.

If you want fast hashing modules on Node, your best options are slightly different:

If you are hashing Buffers: Use the built-in crypto module with the MD5 algorithm.
- The exception to this is: If you don't need incremental hashing, and you need more than 500 MB/sec throughput, and you're OK with a native npm dependency, use murmurhash-native for some extra performance. I didn't test digest sizes of less than 128 bit, as even with 128 bits the hashing is so fast that it's not likely to be a bottleneck.
  
  (Note that murmurhash-native technically supports incremental hashing, but it's slower than Node's built-in MD5 algorithm in this mode.)
If you are hashing a single string non-incrementally, convert it to a Buffer and see the preceding bullet point.
If you are hashing strings incrementally:
- If you only need 32 bits, use iMurmurHash. Note that this will give you collisions after about 2**14 (16,000) hashes.
- If you need more than 32 bits: Use the built-in crypto module with the MD5 algorithm.
  - I also recommend that you experiment with joining strings together into larger chunks, since strings are implicitly converted to Buffers when you pass them to the crypto module, and each Buffer creation is quite slow. Performance will generally be bottlenecked by Buffer creation and string joining, as the native hashing function is very fast in comparison.

Good to know bout iMurmerHash. Looks like it's about twice as fast as SparkMD5. You mentioned you were getting 75MB/sec while my original post got 300MB/30seconds. Not sure if I was slower due to smaller chunks (the overhead you mentioned) or slower computers (my question was from 4 years ago). — Eric Anderson, Feb 06 '17 at 19:07
The chunks in my benchmark are very small -- just a few bytes -- so I'd guess slower computers and slower JavaScript engines. — Jo Liss, Feb 07 '17 at 00:07
Note that if the goal is to check data integrity across copies/transmissions, that's _literally_ what CRC was invented for in the 1960's, and https://www.npmjs.com/package/crc is almost certainly much faster for the same utility than any of the listed options. — Mike 'Pomax' Kamermans, Aug 15 '21 at 16:57

Fast low-collision non-crypto hash in JavaScript for Files

1 Answers1

Linked