13

Background

I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:

  1. it is of length 12;
  2. the order of the sequence elements does not matter;
  3. no element appears twice in the same sequence;
  4. all elements are smaller than about 300.

Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.

I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.

Question

I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.

Any suggestions?

Community
  • 1
  • 1
Arek' Fu
  • 826
  • 8
  • 24
  • Sort the sequence and use `boost::hash_combine` on the individual hashes. – Kerrek SB Oct 11 '12 at 13:45
  • Why do Huffman codes suddenly spring to my head? Total compression-lib flashback. – WhozCraig Oct 11 '12 at 13:45
  • @KerrekSB, I can't use any external libraries or C++11 constructs. – Arek' Fu Oct 11 '12 at 13:48
  • What are the properties of the integers? I'm just wondering if they are spread out relatively evenly whether a quick and dirty option would be to just bitwise-xor the numbers all together to create another int that can be compared. SHould keep collisions to a minimum and not require sorting but if all your numbers are 1 to 100 or something then you're likely to get a lot of collisions... – Chris Oct 11 '12 at 13:48
  • 1
    @Arek'Fu: Look at the code. `boost::hash_combine` is five lines long. Just copy it. – Kerrek SB Oct 11 '12 at 13:49
  • @Chris, did you meant bitwisr-xor? (cool idea BTW!) – gilad hoch Oct 11 '12 at 13:52
  • @Chris, they are all smaller than about 300, so that would not work. – Arek' Fu Oct 11 '12 at 13:52
  • @giladhoch: yeah. Sorry, I always forget there are two types of or and always use xor in my head. :) The idea is that any commutative operations on those numbers will can give an int that can be used for uniqueness and with a reasonable spread of numbers the result should be evenly spread throughout the integer range and so with thousands of integers the chance of a collision are slim. Just adding them would work too but then you have to do the overflow work and other stuff. XOR is a lovely function. :) – Chris Oct 11 '12 at 13:54
  • @Arek'Fu: Ah, I did wonder. When talking about hashing and such like the constraints on the source input are always very important. You could try looking at some other functions like multiplying the numbers (with suitable overflow handling) and checking what the collisions are like after that. Predicting collision chances with that are too hard for me to do in my head though. :) Essentially it boils down to the fact that a hash function is just a function on your 12 numbers that produces another value that should be as unique as possible. – Chris Oct 11 '12 at 13:57
  • @Chris, you are right about the importance of the source. Question edited. – Arek' Fu Oct 11 '12 at 13:59
  • Just out of curiosity, how do you propose checking to see if the sequences match? It seems to me that you would be much better off sorting the sequences to make the hash table keys. That's got to be less overhead than sorting sequences every time you want to do a key comparison. – rici Oct 11 '12 at 14:00
  • Hah, you just edited in a *vital piece of information* on the restricted range. That makes all the difference. I'll update my answer. – Kerrek SB Oct 11 '12 at 14:06
  • @rici, I would store the sorted sequences in a `std::set`. I defined weak ordering between sequences using `std::memcmp`. – Arek' Fu Oct 11 '12 at 14:10
  • So my previous comment was deleted. Whatever, it was true: the refusal to use external libraries is a problem in C++, and generally not a reasonable requirement in a real-world project. The question itself is interesting and got my +1 but the refusal to use external libraries makes it seem as if the OP isn’t really interested in a constructive solution. – Konrad Rudolph Oct 11 '12 at 14:23
  • @KonradRudolph, I couldn't delete your comment even if I wanted to, because I don't have enough reputation. Someone else must have done it. My code is a small part of a 1-MSLOC project, and I'm not allowed to pull in new external dependencies. – Arek' Fu Oct 11 '12 at 14:28
  • @Arek'Fu Comments can only be deleted by moderators or flagging (by enough people). So I know it wasn’t you but I’m quite annoyed that people found it rude or whatever. In response to your comment: my sympathies. I do maintain that not having Boost as a dependency isn’t just a shortcoming, it’s a fatal flaw. I have worked on such projects in the past, it’s simply infuriating. – Konrad Rudolph Oct 11 '12 at 14:35

5 Answers5

6

Here's a basic idea; feel free to modify it at will.

  1. Hashing an integer is just the identity.

  2. We use the formula from boost::hash_combine to get combine hashes.

  3. We sort the array to get a unique representative.

Code:

#include <algorithm>

std::size_t array_hash(int (&array)[12])
{
    int a[12];
    std::copy(array, array + 12, a);
    std::sort(a, a + 12);

    std::size_t result = 0;

    for (int * p = a; p != a + 12; ++p)
    {
        std::size_t const h = *p; // the "identity hash"

        result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
    }

    return result;
}

Update: scratch that. You just edited the question to be something completely different.

If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.

You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:

uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);

(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • Sounds interesting. But will this algorithm yield few collisions? Wouldn't it be better to have similar integers hashed to different sequences? – Arek' Fu Oct 11 '12 at 13:57
  • @Arek'Fu: I'm afraid I don't understand your question. Twelve 32-bit integers have 2^384 possible values, divided by 12! = 479001600. That still leaves about 2^355 possible distinct sequences of integers. Good luck finding a collision-free mapping to a single 32-bit integer. – Kerrek SB Oct 11 '12 at 14:00
  • All the integers are smaller than about 300 (see edited question) and I only have a few thousand sequences anyway. The number of available hashes is definitely sufficient for that. I'm rather afraid that my sequences will be mapped to a small subset of the available hashes, since the entropy of the sequences is rather low (all elements are small, no permutations allowed...). – Arek' Fu Oct 11 '12 at 14:06
  • Oooh, elegant (the second part). Have an upboat. – Konrad Rudolph Oct 11 '12 at 14:13
  • @KerrekSB: That seems sensible but the OP does say "without sorting". It seems pretty trivial if you are allowed to sort (to my mind at least), have you any thoughts on doing it without sorting since that is the interesting part for me... – Chris Oct 11 '12 at 14:14
  • @Arek'Fu: Are the numbers determined statically? You can create a perfect hash function if you know all the values in advance. – Kerrek SB Oct 11 '12 at 14:14
  • @Chris: I think "without sorting" is an insane requirement. How else will you consider two different representations of an unordered sequence equivalent? Note that the "speed is crucial" requirement sounds like superstition to me: Sorting a *copy* of twelve integers, which are contiguous in memory (as in my first code example), should hardly take any time at all. – Kerrek SB Oct 11 '12 at 14:15
  • @KerrekSB, the values are generated on the fly. And from time to time the collection is wiped out and we restart from scratch. Sorting is not forbidden, but I was wondering if one could avoid it by using an intrinsically commutative hash function. For instance, XOR'ing has this property (yeah, yeah, I know it's crap). – Arek' Fu Oct 11 '12 at 14:19
  • 1
    @KerrekSB: You might be right about the not sorting being a silly requirement but from an academic point of view there are hash functions that would work here. So far all those suggested so far have all been fairly poor in terms of avoiding collisions but that would be how you tell if two unordered sequences are equivalent. I'm just wondering if there are any better ones than we've come up with so far. :) – Chris Oct 11 '12 at 14:24
  • @Chris Hash values can't tell you that two unordered sequences are equivalent, only unequivalent ... if the hash values match, the sequences still have to be sorted and then compared. – Jim Balter Mar 18 '14 at 23:31
  • @JimBalter: Bit of a zombie thread so I can't remember exactly what I was meaning but on rereading the question is about finding a hash function to help find equality so I suspect I was just using shorthand. You are totally right in what you say and I assume that was implicit. My point was I suspect that you could create a function that would allow you to determine if things were different without need of ordering. – Chris Mar 19 '14 at 09:33
4

I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.

std::size_t hash(int (&arr)[12]) {
    return std::accumulate(arr, arr + 12, 0);
}

Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of ifelse statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).

EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:

struct pow2 {
    int operator ()(int n) const { return n * n; }
};

std::size_t hash(int (&arr)[12]) {
    int raised[12];
    std::transform(arr, arr + 12, raised, pow2());
    return std::accumulate(raised, raised + 12, 0);
}
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 1
    assuming the numbers are randomly distributed between 0 and 300 I would have thought simple addition would run the risk of having a lot of collisions (or at least proportionally high) around the expected sum (ie 150*12=1800ish). – Chris Oct 11 '12 at 14:10
  • @Chris Yes, definitely. But short of sorting (which will be *much* slower than my method) I don’t see a way around that (and your suggestion of using xor instead of addition works of course equally well, but has the same weakness). – Konrad Rudolph Oct 11 '12 at 14:10
  • 1
    You can reduce collisions via "spreading out" the hashes a bit. Multiply each value by itself once, twice or even three times to utilize more of the 32-bit range. – Deestan Oct 11 '12 at 14:15
  • Yeah, that was said before the upper limit of 300 at which point the collisions became not just obvious but inevitable. I'm wondering about multiplication since errors compound a lot less than with addition but I'm not a C++ programmer (I just like algorithms) so I'm not sure how efficient multiplying them together and doing mod 2^32 at each step would be. If it just truncates on overflow then that should be pretty effective and I'd suspect have relatively few collisions while still being pretty cheap? – Chris Oct 11 '12 at 14:17
  • @Deestan Not sure what you mean. `arr * arr` (etc.) is invalid syntax. – Konrad Rudolph Oct 11 '12 at 14:17
  • 2
    @Chris Multiplication is no good since `0` is an allowed value (i.e. this method would lead to tons of collisions at 0). – Konrad Rudolph Oct 11 '12 at 14:18
  • Sorry about the syntax. I intended to convey "each value to the power of 2, 3 or 4". Corrected my comment. – Deestan Oct 11 '12 at 14:18
  • @KonradRudolph: oops, yes. :) – Chris Oct 11 '12 at 14:20
  • @Deestan I don’t think that would change the number of different hash values. – Konrad Rudolph Oct 11 '12 at 14:21
  • @KonradRudolph Abstractly put, the hash sum now spans a larger range. As an example, consider [2, 5] and [3, 4]. These collide when summed, but their summed squares won't. Some new collisions are introduced, of course, but even more values are "moved out" of the initial range of about 3522. – Deestan Oct 11 '12 at 14:24
  • Yeah, it would certainly reduce the chances of collision. I think you'd probably need some statistical testing before you could say how effectively it would do so. :) – Chris Oct 11 '12 at 14:26
  • @Deestan The range of the span is larger, true. But I think the number of distinct hash values will remain the same, no? I may be mistaken though. Hmm, if I had the time I’d just sample some values to test that. – Konrad Rudolph Oct 11 '12 at 14:26
  • 1
    If you think about a set that may contain the numbers 0 to 3 and only two of them then the addition hash will give you a number in the range 0 to 6 (assuming duplicates are allowed) which is non-unique (obviously). Summing their square will give 10 values (0,1,2,4,5,8,9,10,13,18) which in this case is a unique hash I believe. The more numbers you are adding the more likely a collision is and the higher you raise the power the more you reduce collisions I believe. – Chris Oct 11 '12 at 14:32
  • 2
    They key thing is that with basic addition then increasing one number by one and reducing another by one will give the same hash. If you are squaring the numbers then this is no longer true. – Chris Oct 11 '12 at 14:33
  • @KonradRudolph I did a full pen-and-paper calculation of all combinations of 3 numbers from a set of 5. Collisions were reduced from 3 to 0. – Deestan Oct 11 '12 at 14:35
  • @Chris True. I just wrote a quick R script to test this empirically but it’s logical anyway. – Konrad Rudolph Oct 11 '12 at 14:38
  • For what its worth I wrote a quick c# program to generate random sets of 12 numbers with values 0 to 300 (duplicates allowed) and did the sum of the numbers, squares, third and fourth powers. With the sum the average over a few runs was about 35 values before a collision. With squares it was lookign at about 800. Cubes was of the order of about 12,000 and 4th powers around 50,000 values. So for the case of looking at sets in the thousands cubes is probably the way to go with a minimal chance of collisions. Ah... the mathematician in me is satisfied at last. :) – Chris Oct 11 '12 at 15:07
  • I tried this solution and it is quite slow. It probably does too much math under the hood (multiplications!). – Arek' Fu Oct 11 '12 at 16:50
  • @Arek'Fu Hmm, that’s what I suspected, to be honest. Have you tried the first solution as well? I think Kerrek’s second solution is probably going to beat it but maybe not … – Konrad Rudolph Oct 11 '12 at 17:13
  • 2
    If computations are vectorized, this should be fast enough. If not, probably it is worth to substitute squaring with table lookup (and pre-compute a 300-values table of squares/cubes/pseudo-random values). – Evgeny Kluev Oct 11 '12 at 18:17
  • The comments here are really missing the boat ... only Evgeny mentions table lookup, and only Evgeny mentions random values, but he still doesn't know what property those random values should have ... see (the tail end of) my answer. – Jim Balter Oct 11 '12 at 19:34
  • After some more testing and benchmarking, it turns out that this strategy is quite efficient if combined with a look-up table for the int-to-hash mapping. See my comment to Jim Balter's answer. – Arek' Fu Oct 12 '12 at 17:25
4

You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.

This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.


This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.

We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.

If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.

If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.

Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • I'm not sure I understand the second part of your answer. Are you suggesting to use a 32-bit Bloom filter per sequence, and combine them using hash_combine? – Arek' Fu Oct 12 '12 at 17:28
  • @Arek'Fu: No, after using 32-bit Bloom filter per sequence there is nothing to combine, we already have a single 32-bit hash value. And I'm not suggesting any particular scheme. I just enumerate several possibilities to construct a hash function, satisfying your requirements (bitset of size 32 .. 300, different ways to set/toggle its bits, and using hash_combine only if bitset is larger than required hash size). As for which variant to choose, 64-bit or 32-bit bitset seems to be the fastest and "xor", "+" variants are, probably, better, than "or". – Evgeny Kluev Oct 12 '12 at 17:43
4

Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.

If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).

Jim Balter
  • 16,163
  • 3
  • 43
  • 66
  • I've spent some time trying to implement your second suggestion today, and I think it is the most promising so far. I've constructed 300 random 64-bit strings, with an equal number of 0 and 1 bits each. I tried to mix the mapped values using XOR and using sum -- both strategies give very similar (and very good) performances and collision rates. – Arek' Fu Oct 12 '12 at 17:16
  • I Googled here and there a bit and I'm left with the impression that a trie would be an overkill given the number of sequences that I need to handle. AFAIU, tries outperform hash tables for large sets. The number of my sequences varies a lot -- sometimes it's as few as 10, but it can occasionally go up to 10^6. Can you could suggest an existing simple C++ trie implementation? If I can get something simple running, that would give me an idea of the possible performance gain. – Arek' Fu Oct 12 '12 at 17:24
  • Surprisingly (for me), using 32-bit integers yields very similar collision rates and slightly poorer (!) performance. – Arek' Fu Oct 12 '12 at 17:31
  • @Arek'Fu You're probably right that a trie is overkill for a relatively small number of sequences, when the hash table collision rate is low. I can't suggest an implementation beyond what google yields, e.g., http://stackoverflow.com/questions/1036504/trie-implementation – Jim Balter Oct 12 '12 at 18:32
  • I decided to accept this answer because it came the closest to the algorithm that I ended up with, which is posted as a [separate answer](http://stackoverflow.com/a/12900269/672098). – Arek' Fu Oct 15 '12 at 16:53
2

I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.

Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.

import itertools
import random
import sys

def random_combination(iterable, r):
    "Random selection from itertools.combinations(iterable, r)"
    pool = tuple(iterable)
    n = len(pool)
    indices = sorted(random.sample(xrange(n), r))
    return tuple(pool[i] for i in indices)

mask_size = 64
mask_size_over_2 = mask_size/2

nmasks = 300

suffix='UL'

print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
    combo = random_combination(xrange(mask_size),mask_size_over_2)
    mask = 0;
    for j in combo:
        mask |= (1<<j);
    if(i<nmasks-1):
        print '\t' + str(mask) + suffix + ','
    else:
        print '\t' + str(mask) + suffix + ' };'

The C++ array generated by the script is used as follows:

typedef int_least64_t HashType;

const int maxTableSize = 300;

HashType mask[maxTableSize] = {
  // generated array goes here
};

inline HashType xorrer(HashType const &l, HashType const &r) {
  return l^mask[r];
}

HashType hashConfig(HashType *sequence, int n) {
  return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}

This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.

Community
  • 1
  • 1
Arek' Fu
  • 826
  • 8
  • 24