0

I have structured data like below:

struct Leg
{
char type;
char side;
int qty;
int id;
} Legs[5];

where

type is O or E, 
side is B or S;
qty is 1 to 9999 and qty in all Legs is relative prime to each other i.e. 1 2 3 not 2 4 6
id is an integer from 1 to 9999999 and all ids are unique in the group of Legs

To build unique signature of above data, currently I am building a string like below: first sort Legs based on id; then

signature=""
for i=1 to 5
signature+=id+type+qty+side of leg-i

and I insert into unordered_map so that if any matching structured data comes, I can, lookup by building a signature as above and looking up.

unorderd_map on string means key-compare which is string compare and also hash function which needs to traverse the string which is usually around 25 chars.

For efficiency, it it is possible to build a unique integer out of above data for each structure above, the lookups/insertions in unorderd_map will be extremely faster.

Just wondering if there is any mathematical properties I can take advantage of.

Edit: The map will contain key,value pairs like

<unique-signature=key, value=int-value needs to be located on looking up another repeating Leg group by constructing signature like above after sorting Legs based on id>
<123O2B234E3S456O3S567O2S789E2B, 989>

The goal is to build unique signature from each such unique repeating group of legs. Legs can be in different order and yet they can be match with another group of legs which are in different order thats why I sort based on id which is unique and build the signature.

My signature is string based, if there was a way to construct a unique number signature, then my lookups/insertions will be faster.

Medicine
  • 1,923
  • 2
  • 23
  • 33

3 Answers3

3

You can just create a unique 40-bit number from the fields you have. Why 40 bits? I'm glad you asked.

You have 9,999,999 possible id values, which means you can use 24 bits to represent all possibilities (log2(9999999) = a little over 23).

You have 9,999 possible qty values, which requires another 14 bits.

type and side require 1 bit each, which gives you a total of 40 bits of information. Store this number as a long long and you have a nice, fast key for your map.

If you really want a unique int key then you're probably out of luck because it's going to be pretty tricky to get rid of 8 bits of information. You might be able to take advantage of the co-primality of the qty field to represent it in fewer than 14 bits, however I doubt that you can get it down to 6 bits because that only gives you 64 possible values for qty.

That's a way to get what you asked for, but @David Schwartz's answer is probably what you actually need: hash collisions are generally not expensive unless you have a really bad hash function - see Application vulnerability due to Non Random Hash Functions for an example of how that can bite you - or a carefully crafted data set that happens to hit the worst-case.

In your case you should be fine with David's answer. It'll be fast enough unless you are extremely unfortunate with your set of data.

EDIT: Just noticed that you are computing your signature over the set of 5 Legs. The same math applies, you just will need 200 bits rather than 4. So it won't fit in a long long unless you have some information that can be shared amongst all 5 Leg objects; if each set of 5 shares the same id, for example.

Stick with David's answer.

Community
  • 1
  • 1
Cameron Skinner
  • 51,692
  • 2
  • 65
  • 86
  • May be I was not clear in my question 1.The signature needs to be generated from all the Legs. 2.I need to store an int value for each such unique repeating group of Legs(Legs can be in different order) in the hash table, The idea is to look up that value, so the key needs to be unique as when collision happens, I need to find the value of that key which I believe is found using a KeyEqual function. 3.The way i generate a string signature is always unique but the problem is of efficiency, dynamic allocation and string compares so need a way to generate unique integer out of that Legs group. – Medicine Oct 22 '13 at 02:29
  • Unfortunately each Leg in the repeating group could have different values and unpredictable – Medicine Oct 22 '13 at 02:34
2

It doesn't have to be unique. I would suggest something like:

std::size_t hash_value(const Leg& l)
{
    std::size_t ret = l.type;
    ret << = 8;
    ret |= l.side;
    ret *= 2654435761;
    ret += l.qty;
    ret *= 2654435761;
    ret += l.id;
    return ret * 2654435761;
}
David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • May be I was not clear in my question 1.The signature needs to be generated from all the Legs. 2.I need to store an int value for each such unique repeating group of Legs(Legs can be in different order) in the hash table, The idea is to look up that value, so the key needs to be unique as when collision happens, I need to find the value of that key which I believe is found using a KeyEqual function. 3.The way i generate a string signature is always unique but the problem is of efficiency, dynamic allocation and string compares so need a way to generate unique integer out of that Legs group. – Medicine Oct 22 '13 at 02:29
  • What is the name of this hash function? Just interested. – Simple Oct 22 '13 at 11:11
  • 2
    Knuth's multiplicative hash. – David Schwartz Oct 22 '13 at 11:21
1

In order to create an order-independent hash function for groups of five legs, first choose a hash function for individual legs -- David's answer looks great. Compute the hashes for each of the five legs. Now choose an order-independent function to combine these five hash values. You could, for example, xor the hashes together, or add them all together, or multiply them all together.

The fact that multiplication distributes over addition, and multiplication was the last operation to happen, makes me a little bit wary of using that. I think xor might be the best option of the ones I give here; but before using this in production, you should definitely run a few tests to see if you can easily generate collisions with any of them.

Probably superfluous, but here is a simple implementation that calls hash_value from David's answer:

std::size_t hash_value(const Leg_Array& legs) {
    std::size_t ret = 0;
    for (int i = 0; i < 5; ++i) {
        ret ^= hash_value(legs[i]);
    }
    return ret;
}
Community
  • 1
  • 1
Erik P.
  • 1,577
  • 10
  • 18
  • Sorry but I need to generate a unique number from the set of legs, i.e. for each set of legs the hash_value needs to be unique, there should not be collisions as I intend to store this computed hash as a key and an associated integer value in a map of type . Each unique set of legs is associated with an int value. so instead of storing a map of which is costly for key compare and Legs taking lot of space, I would like to store by calculating unique integer from Legs which needs to be unique for each different set of Legs. – Medicine Oct 22 '13 at 23:27
  • 2
    Impossible: as Cameron shows above, there are roughly 40 bits of information in a Leg, so roughly 200 in a tuple of 5 legs. Any mapping to a, say, 64 bit integer is going to have collisions. You could, in theory, cook up a scheme to fit the data into a BigInt or an array of integers that is at least 200 bits wide... but I'm not sure that would work particularly well. – Erik P. Oct 23 '13 at 12:09