How exactly do the hashlib hashers treat input?

Question

The Python 2.7 documentation has this to say about the hashlib hashers:

hash.update(arg)

    Update the hash object with the string arg. [...]

But I have seen people feed it objects that are not strings, e.g. buffers, numpy ndarrays.

Given Python's duck typing, I'm not surprised that it is possible to specify non-string arguments.

The question is: how do I know the hasher is doing the right thing with the argument?

I can't imagine the hasher naïvely doing a shallow iteration on the argument because that would probably fail miserably with ndarrays with more than one dimension - if you do a shallow iteration, you get an ndarray with n-1 dimensions.

score 2 · Accepted Answer · answered May 15 '15 at 09:55

2

update unpacks its arguments using the s# format spec. This means that it can be either a string, Unicode or a buffer interface.

You can't define a buffer interface in pure Python, but C libraries like numpy can and do - which allows them to be passed into hash.update.

Things like multiple dimension arrays work fine - on the C level they're stored as a contiguous series of bytes.

answered May 15 '15 at 09:55

orlp

112,504
36
218
315

Multi-dimensional arrays from numpy may "work" but whether it counts as "fine" or not really depends on the guarantees your application and libraries like numpy are making about the exact memory layout of the data it presents via the buffer interface. Will it always be the same? Ordered by rows or by columns? What format are the bytes of an array of larger values stored in? etc... Ensure these are always the same before you hash a something other than an obvious linear sequence of bytes. – gps May 16 '15 at 07:10

How exactly do the hashlib hashers treat input?

1 Answers1