I am creating a system that stores large numpy arrays in pyarrow.plasma
.
I want to give each array a unique, deterministic plasma.ObjectID
, np.array
is not hashable sadly
My current (broken) approach is:
import numpy as np
from pyarrow import plasma
def int_to_bytes(x: int) -> bytes:
return x.to_bytes(
(x.bit_length() + 7) // 8, "big"
) # https://stackoverflow.com/questions/21017698/converting-int-to-bytes-in-python-3
def get_object_id(arr):
arr_id = int(arr.sum() / (arr.shape[0]))
oid: bytes = int_to_bytes(arr_id).zfill(20) # fill from left with zeroes, must be of length 20
return plasma.ObjectID(oid)
But this can easily fail, for example:
arr = np.arange(12)
a1 = arr.reshape(3, 4)
a2 = arr.reshape(3,2,2)
assert get_object_id(a1) != get_object_id(a2), 'Hash collision'
# another good test case
assert get_object_id(np.ones(12)) != get_object_id(np.ones(12).reshape(4,3))
assert get_object_id(np.ones(12)) != get_object_id(np.zeros(12))
It also involves summing the array, which could be very slow for large arrays.
Feel free to assume that the dtype
of arr
will be np.uint
or np.int
.
I know think that it's impossible to never have a hash collision (I only have 20 bytes of ID and there are more than 2^20) possible inputs, so I am just looking for something that is either a) cheaper to compute b) less likely to fail in practice
or, ideally, both!