8

I have an integer type, say long, whose values are between Long.MIN_VALUE = 0x80...0 (-2^63) and Long.MAX_VALUE = 0x7f...f (2^63 - 1). I want to hash it with ~50% collision to a positive integer of the same type (i.e. between 1 and Long.MAX_VALUE) in a clean and efficient manner.

My first attempts were something like:

  • Math.abs(x) + 1
  • (x & Long.MAX_VALUE) + 1

but those and similar approaches always have problems with certain values, i.e. when x is 0 / Long.MIN_VALUE / Long.MAX_VALUE. Of course, the naive solution is to use 2 if statements, but I'm looking for something cleaner / shorter / faster. Any ideas?

Note: Assume that I'm working in Java where there is no implicit conversion to boolean and shift semantics is defined.

eold
  • 5,972
  • 11
  • 56
  • 75

9 Answers9

10

The simplest approach is to zero the sign bit and then map zero to some other value:

Long y = x & Long.MAX_VALUE;
return (y == 0)? 42: y;

This is simple, uses only one if/ternary operator, and gives ~50% collision rate on average. There is one disadvantage: it maps 4 different values (0, 42, MIN_VALUE, MIN_VALUE+42) to one value (42). So for this value we have 75% collisions, while for other values - exactly 50%.

It may be preferable to distribute collisions more evenly:

return (x == 0)? 42: (x == Long.MIN_VALUE) ? 142: x & Long.MAX_VALUE;

This code gives 67% collisions for 2 values and 50% for other values. You cannot distribute collisions more evenly, but it is possible to choose these 2 most colliding values. Disadvantage is that this code uses two ifs/ternary operators.

It is possible to avoid 75% collisions on single value while using only one if/ternary operator:

Long y = x & Long.MAX_VALUE;
return (y == 0)? 42 - (x >> 7): y;

This code gives 67% collisions for 2 values and 50% collisions for other values. There is less freedom choosing these most colliding values: 0 maps to 42 (and you can choose almost any value instead); MIN_VALUE maps to 42 - (MIN_VALUE >> 7) (and you can shift MIN_VALUE by any value from 1 to 63, only make sure that A - (MIN_VALUE >> B) does not overflow).


It is possible to get the same result (67% collisions for 2 values and 50% collisions for other values) without conditional operators (but with more complicated code):

Long y = x - 1 - ((x >> 63) << 1);
Long z = y + 1 + (y >> 63);
return z & Long.MAX_VALUE;

This gives 67% collisions for values '1' and 'MAX_VALUE'. If it is more convenient to get most collisions for some other values, just apply this algorithm to x + A, where 'A' is any number.

An improved variant of this solution:

Long y = x + 1 + ((x >> 63) << 1);
Long z = y - (y >> 63);
return z & Long.MAX_VALUE;
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
3

Assuming you want to collapse all values into the positive space, why not just zero the sign bit?

You can do this with a single bitwise op by taking advantage of the fact that MAX_VALUE is just a zero sign bit followed by ones e.g.

int positive = value & Integer.MAX_VALUE;

Or for longs:

long positive = value & Long.MAX_VALUE;

If you want a "better" hash with pseudo-random qualities, you probably want to pss the value through another hash function first. My favourite fast hashes are the XORshift family by George Marsaglia. These have the nice property that they map the entire int / long number space perfectly onto itself, so you will still get exactly 50% collisions after zeroing the sign bit.

Here's a quick XORshift implementation in Java:

public static final long xorShift64(long a) {
    a ^= (a << 21);
    a ^= (a >>> 35);
    a ^= (a << 4);
    return a;
}

public static final int xorShift32(int a) {
    a ^= (a << 13);
    a ^= (a >>> 17);
    a ^= (a << 5);
    return a;
}
mikera
  • 105,238
  • 25
  • 256
  • 415
1

From the information theoretic view, you have 2^64 values to map into 2^63-1 values.

As such, mapping is trivial with the modulus operator, since it always has a non-negative result:

y = 1 + x % 0x7fffffffffffffff;  // the constant is 2^63-1

This could be pretty expensive, so what else is possible?

The simple math 2^64 = 2 * (2^63 - 1) + 2 says we will have two source values mapping to one target value except in two special cases, where three will go to one. Think of these as two special 64-bit values, call them x1 and x2, that each share a target with two other source values. In the mod expression above, this occurs by "wrapping". The target values y=2^31-2 and y=2^31-3 have three mappings. All others have two. Since we have to use something more complex than mod anyway, let's seek a way to map the special values wherever we like at low cost

For illustration let's work with mapping a 4-bit signed int x in [-8..7] to y in [1..7], rather than the 64-bit space.

An easy course is to have x values in [1..7] map to themselves, then the problem reduces to mapping x in [-8..0] to y in [1..7]. Note there are 9 source values here and only 7 targets as discussed above.

There are obviously many strategies. At this point you can probably see a gazzilion. I'll describe only one that's particularly simple.

Let y = 1 - x for all values except special cases x1 == -8 and x2 == -7. The whole hash function thus becomes

y = x <= -7 ? S(x) : x <= 0 ? 1 - x : x;

Here S(x) is a simple function that says where x1 and x2 are mapped. Choose S based on what you know about the data. For example if you think high target values are unlikely, map them to 6 and 7 with S(x) = -1 - x.

The final mapping is:

-8: 7    -7: 6    -6: 7    -5: 6    -4: 5    -3: 4    -2: 3    -1: 2
 0: 1     1: 1     2: 2     3: 3     4: 4     5: 5     6: 6     7: 7

Taking this logic up to the 64-bit space, you'd have

y = (x <= Long.MIN_VALUE + 1) ? -1 - x : x <= 0 ? 1 - x : x;

Many other kinds of tuning are possible within this framework.

Gene
  • 46,253
  • 4
  • 58
  • 96
  • You and I think an awful lot alike... I already had numbers [-8..7] in a list to play with. :) – ErikE Jul 25 '12 at 23:56
1

I would opt for the most simple, yet not totally time wasting version:

public static long postiveHash(final long hash) {
    final long result = hash & Long.MAX_VALUE;
    return (result != 0) ? result : (hash == 0 ? 1 : 2);
}

This implementation pays one conditional operation for all but two possible inputs: 0 and MIN_VALUE. Those two are assigned different value mappings with the second condition. I doubt you get a better combination of (code) simplicity and (computational) complexity.

Of course if you can live with a worse distribution, it gets a lot simpler. By restricting the space to 1/4 instead of to 1/2 -1 you can get:

public static long badDistribution(final long hash) {
    return (hash & -4) + 1;
}
Durandal
  • 19,919
  • 4
  • 36
  • 70
1

If the value is positive, it probably can be used directly, else, invert all bits:

x >= 0 ? hash = x : hash = x ^ Long.MIN_VALUE

However, you should scramble this value a bit more if the values of x are correlated (meaning: similar objects produce similar values for x), maybe with

hash = a * (hash + b) % (Long.MAX_VALUE) + 1

for some positive constants a and b, where a should be quite large and b prevents that 0 is always mapped to 1. This also maps the whole thing to [1,Long.MAX_VALUE] instead of [0,Long.MAX_VALUE]. By altering the values for a and b you could also implement more complex hash functionalities like cooko hashing, that needs two different hash functions.

Such a solution should definitely be preferred instead of one that delivers "strange collision distribution" for the same values each time it is used.

aRestless
  • 1,825
  • 2
  • 18
  • 27
1

You can do it without any conditionals and in a single expression by using the unsigned shift operator:

public static int makePositive(int x) {
  return (x >>> 1) + (~x >>> 31);
}
Harry Johnston
  • 35,639
  • 6
  • 68
  • 158
  • Probably the best way to avoid conditionals. If collapsing 4 values into one is not desirable, this may be preprocessed by `x += x >>> 31`. – Evgeny Kluev Jul 25 '12 at 12:54
0

Just to make sure, you have a long and want to hash it to an int?

You could do...

(int) x                 // This results in a meaningless number, but it works
(int) (x & 0xffffffffl) // This will give you just the low order bits
(int) (x >> 32)         // This will give you just the high order bits
((Long) x).hashcode()   // This is the high and low order bits XORed together

If you want to keep a long you could do...

x & 0x7fffffffffffffffl // This will just ignore the sign, Long.MIN_VALUE -> 0
x & Long.MAX_VALUE      // Should be the same I think

If getting a 0 is no good...

x & 0x7ffffffffffffffel + 1 // This has a 75% collision rate.

Just thinking out loud...

((x & Long.MAX_VALUE) << 1) + 1 // I think this is also 75%

I think you're going to need to either be ok with 75% or get a little ugly:

(x > 0) ? x : (x < 0) ? x & Long.MAX_VALUE : 7
Hounshell
  • 5,321
  • 4
  • 34
  • 51
0

This seems the simplest of all:

(x % Long.MAX_VALUE) + 1

I would be interested in speed comparisons of all the methods given.

ErikE
  • 48,881
  • 23
  • 151
  • 196
0

Just AND your input value with Long.MAX_VALUE and OR it with 1. Nothing else needed.

Ex:

long hash = (input & Long.MAX_VALUE) | 1;
  • Good and simple approach. Only problem is, that three very similar values (-1, 0, 1) are always mapped to the same single value (1). – aRestless Jul 26 '12 at 13:32
  • I think that still more than qualifies for the approximate ~50% collision as stated in the original question, yes? – Roger Johnson Jul 26 '12 at 14:42