2

I am wondering, how come hashing e.g. strings in an np object[] produces expected results:

>>> hashlib.sha256(np.array(['asdfda'], dtype=object)).hexdigest()
'6cc08fd2542235fe8097c017c20b85350899c81616db8cb59045022663e3cee1'
>>> hashlib.sha256(np.array(['asd'+'fda'], dtype=object)).hexdigest()
'6cc08fd2542235fe8097c017c20b85350899c81616db8cb59045022663e3cee1'

That is, the hashing takes into account the actual object value, not a just the pointer value, as stored in the array. (Those strings would definitely have different pointers.)

hashlib methods seem to accepting objects supporting some 'buffer API', as not doing so produces TypeError: object supporting the buffer API required.

Does that mean that buffer API implementation for numpy's ndarray does not return an array of pointers, but rather somehow an array of strings, or in other words how does hashlib.hash_algorithm get to those stored strings of characters?

Adam
  • 1,724
  • 4
  • 21
  • 31
  • I'd suggest that [this](https://docs.python.org/3/c-api/buffer.html) is the "some 'buffer API'" that it's required to support. It doesn't really answer what it's doing with the PyObjects but it gives you a clue about where to start. – DavidW Nov 25 '20 at 16:47

1 Answers1

0

Those strings would definitely have different pointers.

Definitely is a pretty strong claim here. Look what I see just testing that out in a REPL:

>>> s = 'asdfda'
>>> s2 = 'asd'+'fda'
>>> s is s2
True

However,

>>> s3 = s[:2] + s[2:]
>>> s is s3
False
>>>

And just as expected, the hash is different:

>>> hashlib.sha256(np.array([s],dtype=object)).hexdigest()
'176c63097ace4b6754acdd8e37b861bbe1e33489f52d6bd8df07983ead23c73e'
>>> hashlib.sha256(np.array([s3],dtype=object)).hexdigest()
'478307a1bfb4bf413c7e538cc4bbe02370072b0968a91155a4a838e68477f62e'
>>>
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Oh no. I didn't realize string interning was so powerful: https://stackoverflow.com/a/24245514/3791837 – Adam Nov 25 '20 at 18:30
  • I believed it because it, in my case, worked :D, i.e. I got the same hash value. The case: my pandas Series had for the same string somehow same pointers, even though it was read from a csv and the strings were on different lines and I did _not_ specify the Series to be of type 'category'. See more at: https://stackoverflow.com/q/65012145/3791837 – Adam Nov 25 '20 at 20:24
  • @Adam probably `pandas.read_csv` interns the strings while it parses the csv. That isn't entirely unusual, for example, `json.load` does the same thing – juanpa.arrivillaga Nov 25 '20 at 20:30