I have to frequently search hashes in a large (up to 1G) CSV database of the format
sha256_hash, md5_hash, sha1_hash, field1, field2, field3 etc
in C. This needs to be very fast and memory usage is a non-issue (32G minimum). I found this which is very close to what I had in mind: load the data into RAM, one-time order the database by hash, index by first 'n' bytes of the hash and then search through smaller sublists. But the thread above doesn't seem to address a question I have in mid. Since I'm not a cryptography guy, I was wondering about the distribution of hashes and whether of not it could be used to make searching the sublists even faster. Any suggestion about this or or my general approach?