2

I am implementing Spooky-hash on one of the applications I am building.

I am referencing the Golang and C libraries. These provide the output int the form of 2 unsigned 64bit integers.

When looking at the python implementation (which is a wrapper on the C++) implementation they are deriving a 128 big number and giving back an answer.

My problem is, what is python doing with the 2 64uint values to get this number?

I think this is the relevant C++ code (from the python wrapper) where it invokes the original C++ library:

static PyObject *
spooky_hash128(PyObject *self, PyObject *args, PyObject *kwargs)
{
    const char *message;
    int message_length;
    uint64 seed[2] = {0};

static char *kwlist[] = {(char *)"message", (char *)"seed",
    NULL};

if (!PyArg_ParseTupleAndKeywords(args, kwargs, "s#|K", kwlist,
    &message, &message_length, &seed)) {
    return NULL;
}

seed[1] = seed[0];

SpookyHash::Hash128(message, message_length, &seed[0], &seed[1]);

PyObject *retval = _PyLong_FromByteArray((unsigned char *)seed, 16, 1, 0);
    return retval;
}

So for a string like

15496-17156-0228-a1c731ea-289b-dcf3-a5d8-afb9b6ba34609-5aba2fe5-54ff-098e-c0eb-457

The correct 2 64 uints are 12579423875165067478 and 12351582206331609335

The python 128 integer is : 227846475865583962700201584165695002838

But how is the 128 bit integer derived from the 2 64 uints -- Any pointers will be helpful to understand this.

DMin
  • 10,049
  • 10
  • 45
  • 65

3 Answers3

2

It does the arithmetic operations required to obtain the 128bit number out of the 2 64bit ones:

  • Shift the 1st (most significant) one 64 bits to the left
  • Add the 2nd one

In other words, it concatenates them.

Example (note that you listed the numbers in reversed order):

>>> ui64_0 = 12579423875165067478
>>> ui64_1 = 12351582206331609335
>>>
>>> ui128_0 = (ui64_1 << 64) + ui64_0
>>> ui128_0
227846475865583962700201584165695002838

This is possible because Python integers are unlimited (or better: limited by largest available memory chunk), as [Python 3.Docs]: Numeric Types - int, float, complex states:

Integers have unlimited precision.

CristiFati
  • 38,250
  • 9
  • 50
  • 87
2

The code uses an unsupported function from the Python C-API to take an arbitrary unsigned char array and turning that into an integer. From the definition of _PyLong_FromByteArray() you can see why the calling code includes a cast from uint64[] to char[]:

PyObject *
_PyLong_FromByteArray(const unsigned char* bytes, size_t n,
int little_endian, int is_signed)

So instead of taking two 64-bit numbers, it is passed 16 8-bit numbers, which is what the (unsigned char *) cast is for. The call passes in 16 for n, and little_endian is set to 1 and is_signed to 0.

In Python code, you can do the same with the int.to_bytes() method; convert both to bytes of length 8, little-endian (as the SpookyHash C++ reference implementation is explicitly designed for 64-bit little-endian architectures):

>>> bytevalue = (12579423875165067478).to_bytes(8, 'little') + (12351582206331609335).to_bytes(8, 'little')
>>> bytevalue
b'\xd6\x18H\xa6]\x17\x93\xae\xf7`n>\x93\xa2i\xab'
>>> list(bytevalue)
[214, 24, 72, 166, 93, 23, 147, 174, 247, 96, 110, 62, 147, 162, 105, 171]

Each byte is a component of the final number as a multiple of a power of 256. The least significant byte is multiplied by 256 ** 0, the next by 256 ** 1, etc. In a little-endian system, the lowest number comes first (so the 256 to the power 0 value), and in the above, the 171 at the right is the most significant, being 171 times 256 to the power 15.

You can re-create the number in Python code by doing this yourself:

value = 0
for i, b in enumerate(bytevalue):
    value += b * (256 ** i)

which produces the expected output:

>>> bytevalue = (12579423875165067478).to_bytes(8, 'little') + (12351582206331609335).to_bytes(8, 'little')
>>> for i, b in enumerate(bytevalue):
...     value += b * (256 ** i)
...
>>> value
227846475865583962700201584165695002838

except CPUs use bit-shifting to achieve this; shifting a value 8 bits to the left is the same thing as multiplying it by 256, and repeated applications of such shifts would multiply the value by a power of 256. If you started at the most-significant byte and kept shifting the value-so-far to the left by 8 bits before including the next byte (using bit-wise OR), you'd get the same output:

>>> value = 0
>>> for b in reversed(bytevalue):
...     value = value << 8 | b
...
>>> value
227846475865583962700201584165695002838

To avoid reversing, you could shift the current byte by the number of bits already accumulated before combining:

>>> accumbits = 0
>>> for b in bytevalue:
...     value |= (b << accumbits)
...     accumbits += 8
...
>>> value
227846475865583962700201584165695002838

This is what the _PyLong_FromByteArray Implementation actually uses. However, the internal structure of a Python int value actually splits up large integers into multiple 30-bit or 15-bit 'chunks' so arbitrarily large integer values can be fit into fixed-size C integers, which is why the function also uses some additional testing for and shifts with PyLong_SHIFT.

All this comes down to the two 64-bit input values being placed end-to-end in memory to form a long 128-bit number; the first number (being least significant) to the right of the second number (being more significant), so in Python code you could just shift the second number 64 bits to the left and attach the result to the first:

>>> 12579423875165067478 | 12351582206331609335 << 64
227846475865583962700201584165695002838
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

Convert those numbers to hex, and you will see the connection:

12579423875165067478 = AE93175DA64818D6h
12351582206331609335 = AB69A2933E6E60F7h

227846475865583962700201584165695002838 = AB69A2933E6E60F7AE93175DA64818D6h

Let's see this in more detail:

227846475865583962700201584165695002838 = AB69A2933E6E60F7 AE93175DA64818D6h

That 128bit number is just divided into two 64bit values.

Jeroen Jacobs
  • 1,423
  • 3
  • 16
  • 32