2

I am creating a system that stores large numpy arrays in pyarrow.plasma. I want to give each array a unique, deterministic plasma.ObjectID, np.array is not hashable sadly My current (broken) approach is:

import numpy as np
from pyarrow import plasma
def int_to_bytes(x: int) -> bytes:
    return x.to_bytes(
        (x.bit_length() + 7) // 8, "big"
    )  # https://stackoverflow.com/questions/21017698/converting-int-to-bytes-in-python-3
def get_object_id(arr):
    arr_id = int(arr.sum() / (arr.shape[0]))
    oid: bytes = int_to_bytes(arr_id).zfill(20)  # fill from left with zeroes, must be of length 20
    return plasma.ObjectID(oid)

But this can easily fail, for example:

arr = np.arange(12)
a1 = arr.reshape(3, 4)
a2 = arr.reshape(3,2,2)
assert get_object_id(a1) != get_object_id(a2), 'Hash collision'
# another good test case
assert get_object_id(np.ones(12)) != get_object_id(np.ones(12).reshape(4,3))
assert get_object_id(np.ones(12)) != get_object_id(np.zeros(12))

It also involves summing the array, which could be very slow for large arrays. Feel free to assume that the dtype of arr will be np.uint or np.int.

I know think that it's impossible to never have a hash collision (I only have 20 bytes of ID and there are more than 2^20) possible inputs, so I am just looking for something that is either a) cheaper to compute b) less likely to fail in practice

or, ideally, both!

Sam Shleifer
  • 1,716
  • 2
  • 18
  • 29

1 Answers1

3

The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.

def hasharr(arr):
  hash = hashlib.blake2b(arr.tobytes(), digest_size=20)
  for dim in arr.shape:
    hash.update(dim.to_bytes(4, byteorder='big'))
  return hash.digest()

Exmaple:

>>> hasharr(a1)
b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ'
>>> hasharr(a2)
b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"

I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.

I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • This is really close to working! The only issue with it is that if a1.shape has an entry larger than 256, `bytes(a1.shape)` raises `ValueError: bytes must be in range(0, 256)` – Sam Shleifer Feb 26 '21 at 16:07
  • 1
    Thanks for pointing that out. I've updated the alg. It allocates 4 bytes per dimension which should be good for most cases I think. You could always make it dynamic to use as few as neccesary as a bonus exercise. – Pace Feb 26 '21 at 17:50
  • 1
    Cool I ended up just hashing the shape of the array and `arr[0].to_bytes()` to make it faster to compute the hash, but this was massively helpful! – Sam Shleifer Feb 27 '21 at 18:33