2

I am trying to generate non-sequential human readable order codes derived from (lets say) a unsigned 32bit internal id that starts at 1 and is auto incremented for each new order.

In my example code below, will every $hash be unique? (I plan to base34 encode the $hash to make it human readable.)

<?php
function int_hash($key) {
  $key = ($key^0x47cb8a8c) ^ ($key<<12);
  $key = ($key^0x61a988bc) ^ ($key>>19);
  $key = ($key^0x78d2a3c8) ^ ($key<<5);
  $key = ($key^0x5972b1be) ^ ($key<<9);
  $key = ($key^0x2ea72dfe) ^ ($key<<3);
  $key = ($key^0x5ff1057d) ^ ($key>>16);
  return $key;
}

for($order_id = 1; $order_id <= PHP_INT_MAX; ++$order_id) {
  $hash = int_hash($order_id);
}
?>

If not, are there any suggestions on how to replace int_hash?

The result of say, base34 encoding a md5($order_id) is too long for my liking.

brightemo
  • 120
  • 1
  • 9
  • What makes you think that they will be unique? (aka "Why this particular algorithm?"). – Oliver Charlesworth Mar 04 '12 at 00:18
  • I was more hoping it would be. Perhaps I should have just straight out asked how I would go about creating a perfect hash function for a unsigned 32-bit int. I was hoping since they are just numbers that there would be a set of simple mathematical operations I could apply to them which would produce a 1:1 mapping. – brightemo Mar 04 '12 at 00:22
  • All you need is a bit more than 8GiB of space (and of course some time) to answer your question. – Ignacio Vazquez-Abrams Mar 04 '12 at 00:37
  • Possible duplicate of [php short hash](http://stackoverflow.com/questions/959957/php-short-hash) – Leigh Mar 04 '12 at 01:11
  • @Leigh Thank you, a link in that question was quite helpful. I also came across Skip32 (a 32-bit block cipher based on Skipjack) during my own research which would seem to be suitable also. – brightemo Mar 04 '12 at 01:23
  • @brightemo: With all hashes you're going to have a chance of collision though. I don't know of any that guarantee 2^32 unique combinations for 2^32 sequential inputs. I think you'd be better off with some sort of permutation table. – Leigh Mar 04 '12 at 01:28

1 Answers1

18

In my example code below, will every $hash be unique?

Almost. (Which, I guess, means "no, but in a way that's easily fixed".) Your function consists of a sequence of independent steps; the overall function is bijective (reversible) if and only if every single one of those steps is. (Do you see why?)

Now, each step has one of the following forms:

  $key = ($key ^ CONSTANT) ^ ($key >> NUM_BITS);
  $key = ($key ^ CONSTANT) ^ ($key << NUM_BITS);

with NUM_BITS != 0.

We can actually treat these as variants of a single form, by viewing the former as almost equivalent to this:

  $key = invert_order_of_bits($key); # clearly bijective
  $constant = invert_order_of_bits(CONSTANT);
  $key = ($key ^ $constant) ^ ($key << NUM_BITS);
  $key = invert_order_of_bits($key); # clearly bijective

So all we need is to show that this:

  $key = ($key ^ CONSTANT) ^ ($key << NUM_BITS);

is bijective. Now, XOR is commutative and associative, so the above is equivalent to this:

  $key = $key ^ ($key << NUM_BITS);
  $key = $key ^ CONSTANT;

and (x ^ y) ^ y == x ^ (y ^ y) == x ^ 0 == x, so clearly XOR-ing with a constant value is reversible (by re-XOR-ing with the same value); so all we have to show is that this is bijective:

  $key = $key ^ ($key << NUM_BITS);

whenever NUM_BITS != 0.

Now, I'm not writing a rigorous proof, so I'll just give a single reasoned-out example of how to reverse this. Suppose that $key ^ ($key << 9) is

0010 1010 1101 1110 0010 0101 0000 1100

How do we obtain $key? Well, we know that the last nine bits of $key << 9 are all zeroes, so we know that the last nine bits of $key ^ ($key << 9) are the same as the last nine bits of $key. So $key looks like

bbbb bbbb bbbb bbbb bbbb bbb1 0000 1100

so $key << 9 looks like

bbbb bbbb bbbb bb10 0001 1000 0000 0000

so $key looks like

bbbb bbbb bbbb bb00 0011 1101 0000 1100

(by XOR-ing $key ^ ($key << 9) with $key << 9), so $key << 9 looks like

bbbb b000 0111 1010 0001 1000 0000 0000

so $key looks like

bbbb b010 1010 0100 0011 1101 0000 1100

so $key << 9 looks like

0101 1000 0111 1010 0001 1000 0000 0000

so $key looks like

0111 0010 1010 0100 0011 1101 0000 1100

So . . . why do I say "almost" rather than "yes"? Why is your hash-function not perfectly bijective? It's because in PHP, the bitwise shift operators >> and << are not quite symmetric, and while $key = $key ^ ($key << NUM_BITS) is completely reversible, $key = $key ^ ($key >> NUM_BITS) is not. (Above, when I wrote that the two types of steps were "almost equivalent", I really meant that "almost". It makes a difference!) You see, whereas << treats the sign bit just like any other bit, and shifts it out of existence (bringing in a zero-bit on the right), >> treats the sign bit specially, and "extends" it: the bit that it brings in on the left is equal to the sign bit. (N.B. Your question mentions "unsigned 32bit" values, but PHP doesn't actually support that; its bitwise operations are always on signed integers.)

Due to this sign extension, if $key starts with a 0, then $key >> NUM_BITS starts with a 0, and if $key starts with a 1, then $key >> NUM_BITS also starts with a 1. In either case, $key ^ ($key >> NUM_BITS) will start with a 0. You've lost exactly one bit of entropy. If you give me $key ^ ($key >> 9), and don't tell me whether $key is negative, then the best I can do is compute two possible values for $key: one negative, one positive-or-zero.

You perform two steps that use right-shift instead of left-shift, so you lose two bits of entropy. (I'm hand-waving slightly — all I've actually demonstrated is that you lose at least one bit and at most two bits — but I'm confident that, due to the nature of the steps between those right-shift steps, you do actually lose two full bits.) For any given output value, there are exactly four distinct input-values that could yield it. So it's not unique, but it's almost unique; and it's easily fixed, by either:

  • changing the two right-shift steps to use left-shifts instead; or
  • moving both of the right-shift steps to the start of the function, before any left-shift steps, and saying that outputs are unique for inputs between 0 and 231−1 rather than inputs between 0 and 232−1.
ruakh
  • 175,680
  • 26
  • 273
  • 307
  • 1
    An impressive answer indeed. Although I got slightly lost in the 2nd from last paragraph. Might want to slip a line break or two in there :) – Leigh Mar 04 '12 at 02:01
  • Thank you, that was very helpful! My brain keeps telling me that it can not possibly be unique, it is just too simple. (Other solutions I have seen are using some form of encryption, lookup tables, or multiplication by prime numbers, see [link](http://blog.kevburnsjr.com/php-unique-hash)). – brightemo Mar 04 '12 at 02:20
  • @brightemo: You're welcome! I was also really surprised. At first, when I'd just started working through the logic, I completely expected to find a way to generate collisions, rather than to find an argument for why there aren't any. :-P – ruakh Mar 04 '12 at 02:25