In python/numpy, I have a 10,000x10,000 array named random_matrix
. I use md5 to compute the hash for str(random_matrix)
and for random_matrix
itself. It takes 0.00754404067993 seconds on the string version, and 1.6968960762 on the numpy array version. When I make it into a 20,000x20,000 array, it takes 0.0778470039368 on the string version and 60.641119957 seconds on the numpy array version. Why is this? Do numpy arrays take up a lot more memory than strings? Also, if I want to make filenames identified by these matrices, is converting to a string before computing hashes a good idea, or are there some drawbacks?
Asked
Active
Viewed 1,116 times
1

mlstudent
- 948
- 2
- 15
- 30
1 Answers
7
str(random_matrix)
will not include all of the matrix due to numpy's eliding things with "...":
>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
...,
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]
[ 1. 1. 1. ..., 1. 1. 1.]]
So when you hash str(random_matrix)
, you aren't really hashing all the data.
See this previous question and this one about how to hash numpy arrays.