3

I have to frequently search hashes in a large (up to 1G) CSV database of the format

sha256_hash, md5_hash, sha1_hash, field1, field2, field3 etc

in C. This needs to be very fast and memory usage is a non-issue (32G minimum). I found this which is very close to what I had in mind: load the data into RAM, one-time order the database by hash, index by first 'n' bytes of the hash and then search through smaller sublists. But the thread above doesn't seem to address a question I have in mid. Since I'm not a cryptography guy, I was wondering about the distribution of hashes and whether of not it could be used to make searching the sublists even faster. Any suggestion about this or or my general approach?

Community
  • 1
  • 1

3 Answers3

1

Yes, a bloom filter can be used to kick out 'definite negatives' early, by using the distribution of the hash bits.

http://en.wikipedia.org/wiki/Bloom_filter

To create a bloom filter for a given bucket, logical OR all the hashes together to create your filter. Then logical AND the filter with your target hash. If the result is < your target hash (or result XOR target hash != 0), that bucket definitely does not contain that target hash, and you can skip searching it, but if the result == target hash, that bucket MAY contain your target hash, and you need to continue with searching it to be sure. The bloom filter can be cached and updated simply when new hashes are added, but has to be recomputed when hashes are removed, so all that remains for the search is the AND and < operations, which are very cheap and will reduce your O(N) operation to O(1) time in the best case scenario.

Care has to be taken with regards to bucket size so that filters of meaningful value are produced, because a filter of all high bits is of no value to anyone.

  • Read the Wiki article, this is just what I'm looking for. Thanks. It looks like Murmur3 is the recommended hash to use with Bloom filters. Any comment on that? – user3715971 Jun 07 '14 at 18:41
  • You CAN use Murmur3, but at that point I fear you'll be reducing your bit-space even further. Since you're working with hashes to begin with, I don't see a need to rehash them again, I think you can use your source material for your filter. If, of course, that's not the case, and you need to hash your keys first, then Murmur3 is a great choice because you can adjust your keysize (32/128bit) to fit your expected distribution, and it's very cheap to compute. CityHash is also an excellent choice. – Travis Savo Jun 09 '14 at 19:10
0

This is a very easy problem to solve with a lot of memory. Make the hash the key of a hashtable. Make the hash that you provide to the table the first N bytes of the hash (because they are so random that no human on earth can tell them apart from truly random data).

Not sure what's up with your idea with keying the table with a prefix of the key and having sublists. Any stock library-provided hashtable can easily solve your problem.

Or put it into any database and make the hash the primary key.

usr
  • 168,620
  • 35
  • 240
  • 369
  • "Perfectly random" is the incorrect terminology: the term you are looking for is *uniform distribution*. – Dietrich Epp Jun 06 '14 at 18:02
  • @DietrichEpp you might be right. They are so random that no human on earth can tell them apart from truly random data. – usr Jun 06 '14 at 18:03
  • Either something is random or it is not random, and we don't know if these particular hashes are random or not. They do have a particular distribution, however. "Random" is the wrong term. – Dietrich Epp Jun 06 '14 at 18:08
  • @usr That's not what uniform distribution means and not actually the relevant property here. A random variable can have any of a infinite number of distributions (which describes how likely each value is). A random bit stream might output 1 far more often than 0 for example. Only a uniform distribution (all outputs equally likely), or something very similar, makes hash tables fast. Whether *humans* can notice that this is not "truly random" is of no interest whatsoever. –  Jun 06 '14 at 18:09
  • @delnan: I am a little confused reading your comment: "That's not what uniform distribution means"... what is "that" which the sentence refers to? – Dietrich Epp Jun 06 '14 at 18:20
  • @delnan humans not being able to notice any non-randomness means that we don't need to worry that we have tons of hash collisions. I anticipated the usual commentary that hashes are not guaranteed to be unique. That's true, but a hashtable will still work here. – usr Jun 06 '14 at 18:26
  • @DietrichEpp "uniform distribution" is not a sufficient property. The sequence 0,1,0,1,... is uniformly distributed, yet not random. (That said I don't care so much about the terms used here. We all know what I mean.) – usr Jun 06 '14 at 18:27
  • @usr: I'm talking about uniformity in the space 2^256, not over individual bits. – Dietrich Epp Jun 06 '14 at 18:32
  • @usr No commentary about hashes not being guaranteed to be unique. But hash collisions are quite separate from perceived randomness: Perfectly predictable hashes can have great collision behavior, and "perfectly random" hashes can have poor clustering behavior leading to bad (not actively terrible, but far worse than hoped) collision behavior with related inputs. I also maintain that "random" is the wrong moniker for what you describe - we now know what you mean, but it's still imprecise and will cause confusion in other contexts. –  Jun 06 '14 at 19:13
0

The distribution of hashes is uniform, which is helpful because you can put the hashes in a hash table.

// something like this...
struct entry {
    bool used;
    unsigned char sha256[32];
    char field1[20];
    char field2[20];
};

If you don't need to delete entries from the hash table, just create a big array of struct entry and insert records from the CSV into the index corresponding to some bits from the SHA-256 hash. Use linear probing to insert entries: if entry i is taken, use i+1, or i+2, until you find a free entry.

struct table {
    int nbits;
    struct entry *entries;
};

unsigned read_int(unsigned char *data)
{
    unsigned v = data[0] | (data[1] << 8) |
                 (data[2] << 16) | ((unsigned)data[3] << 24);
}

struct entry *find_entry(struct table *table, unsigned char *sha256)
{
    unsigned index = read_int(sha256);
    unsigned mask = (1u << table->nbits) - 1;
    while (1) {
        struct entry *e = &table->entries[index & mask];
        if (!e->used)
            return NULL;
        if (!memcmp(e->sha256, sha256, 32))
            return e;
        index++;
    }
}
Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • What hash function would you suggest? Just taking the whole SHA-256 hash modulo the table size (which is presumably a power of two)? –  Jun 06 '14 at 18:10
  • @delnan: Yes, just slice off bits from the SHA-256 hash. – Dietrich Epp Jun 06 '14 at 18:12