Why is md5 hashing so much faster on strings than on numpy arrays in python?

Question

In python/numpy, I have a 10,000x10,000 array named random_matrix. I use md5 to compute the hash for str(random_matrix) and for random_matrix itself. It takes 0.00754404067993 seconds on the string version, and 1.6968960762 on the numpy array version. When I make it into a 20,000x20,000 array, it takes 0.0778470039368 on the string version and 60.641119957 seconds on the numpy array version. Why is this? Do numpy arrays take up a lot more memory than strings? Also, if I want to make filenames identified by these matrices, is converting to a string before computing hashes a good idea, or are there some drawbacks?

score 7 · Accepted Answer · edited May 23 '17 at 11:49

str(random_matrix) will not include all of the matrix due to numpy's eliding things with "...":

>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

So when you hash str(random_matrix), you aren't really hashing all the data.

See this previous question and this one about how to hash numpy arrays.

Why is md5 hashing so much faster on strings than on numpy arrays in python?

1 Answers1