4

I am woring with a numpy's 1d array with thousands of uint64 numbers in python 2.7. What is the fastest way to calculate the md5 of every number individually?

Each number has to be converted to string before calling the md5 function. I read in many places that iterating over numpy's arrays and doing stuff in pure python is dead slow. Is there any way to circumvent that?

Adrian W
  • 4,563
  • 11
  • 38
  • 52
Frederico Schardong
  • 1,946
  • 6
  • 38
  • 62
  • what's the point of this conversion? how md5 string can be used, that the original float64 can not? – lenik Aug 13 '18 at 23:44
  • I just want to convert the uint64 to strings and then get their MD5 as fast as possible. Gonna use those md5 strings later on. – Frederico Schardong Aug 14 '18 at 00:09
  • I'm pretty sure that @lenik is right and that you don't *need* this conversion. Converting before applying the MD5 seems to be an attempt of optimizing a code that is not even yet functional. Would you have a try applying lenik's suggestion? – Tim Dec 02 '19 at 17:07

3 Answers3

9

You can write a wrapper for OpenSSL's MD5() function that accepts NumPy arrays. Our baseline will be a pure Python implementation.

Create a builder

# build.py
import cffi

ffi = cffi.FFI()

header = r"""
void md5_array(uint64_t* buffer, int len, unsigned char* out);
"""

source = r"""
#include <stdint.h>
#include <openssl/md5.h>

void md5_array(uint64_t * buffer, int len, unsigned char * out) {
    int i = 0;
    for(i=0; i<len; i++) {
        MD5((const unsigned char *) &buffer[i], 8, out + i*16);
    }
}
"""

ffi.set_source("_md5", source, libraries=['ssl'])
ffi.cdef(header)

if __name__ == "__main__":
    ffi.compile()

and a wrapper

# md5.py
import numpy as np
import _md5

def md5_array(data):
    out = np.zeros(data.shape, dtype='|S16')

    _md5.lib.md5_array(
        _md5.ffi.from_buffer(data),
        data.size,
        _md5.ffi.cast("unsigned char *", _md5.ffi.from_buffer(out))
    )
    return out

and compare the two:

# run.py
import numpy as np
import hashlib
import md5

data = np.arange(16, dtype=np.uint64)
out = [hashlib.md5(i).digest() for i in data]
out2 = md5.md5_array(data)

print(data)
# [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
print(out)
# [b'}\xea6+?\xac\x8e\x00\x95jIR\xa3\xd4\xf4t', ... , b'w)\r\xf2^\x84\x11w\xbb\xa1\x94\xc1\x8c8XS']
print(out2)
# [b'}\xea6+?\xac\x8e\x00\x95jIR\xa3\xd4\xf4t', ... , b'w)\r\xf2^\x84\x11w\xbb\xa1\x94\xc1\x8c8XS']

print(all(out == out2))
# True

To compile the bindings and run the script, run

python build.py
python run.py

For large arrays it's about 15x faster (I am a bit disappointed by that honestly...)

data = np.arange(100000, dtype=np.uint64)

%timeit [hashlib.md5(i).digest() for i in data]
169 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit md5.md5_array(data)
12.1 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Nils Werner
  • 34,832
  • 7
  • 76
  • 98
  • AttributeError: module '_md5' has no attribute 'lib', why? – Mowshon Dec 28 '19 at 20:43
  • I am not sure. Maybe your cffi version is very old or you have a `_md5.py` file in that directory. – Nils Werner Dec 28 '19 at 21:01
  • I also experienced the AttributeError: module '_md5' has no attribute 'lib' problem and I'm running `cffi` version `0.14.5` (released after this was posted if I'm not mistaken), so I would say it is not an old version problem...anyone managed to make fix the 'no lib' problem? Cheers :) – Luca Clissa Dec 02 '21 at 08:58
  • 1
    @LucaClissa You probably forgot to build the bindings. I have adjusted the answer to be more explicit about it. – Nils Werner Dec 02 '21 at 10:03
  • I apologize in advance for the (possibly) silly answer: I run build.py and run.py code within the same interactive session and I can see three *_md5*-related files in the folder (`_md5.c`, `_md5.o` and `_md5.cpython-39-x86_64-linux-gnu.so`). Is that what you meant by build the bindings? – Luca Clissa Dec 09 '21 at 16:19
  • Yes. These files are the output of CFF after running `build.py` – Nils Werner Dec 09 '21 at 19:15
  • This seems to be an excellent solution for a quite common problem, but having to build the `_md5` bindings beforehand is not ideal. I wonder if there is any library out there that already does this at install time – Carles Sala Aug 31 '22 at 16:39
  • [CFFI can do this at install time](https://cffi.readthedocs.io/en/latest/cdef.html), if you set it up correctly. Essentially you'll have to add a `cffi_modules` key to your `setup.py`. – Nils Werner Sep 01 '22 at 10:04
2

I would definitely recommend to avoid converting uint64 to strings. You may use struct to get the binary data that can be subsequently fed to the hashlib.md5():

>>> import struct, hashlib
>>> a = struct.pack( '<Q', 0x423423423423 )
>>> a
'#4B#4B\x00\x00'
>>> hashlib.md5( a ).hexdigest()
'de0fc624a1b287881eee581ed83500d1'
>>> 

This would definitely will speed up the process, since there's no conversion, just simple byte copies.

Also, gettig hexdigest() may be replaced with digest(), that returns the binary data, which is faster that converting that to hex string. Depending on how you're planning to use that data later, this might be a good approach.

lenik
  • 23,228
  • 4
  • 34
  • 43
1

ATTENTION! Sorry, I missed the question. The code below calculates MD5 of the whole array and without any conversion. This was put into a wrong place.

>>> import hashlib
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4, 5], dtype="uint64")
>>> m = hashlib.md5(arr.astype("uint8"))
>>> m.hexdigest()
'7cfdd07889b3295d6a550914ab35e068'
SzieberthAdam
  • 3,999
  • 2
  • 23
  • 31
  • It looks like this gets the md5 of the whole array, not each element – thomaskeefe Nov 21 '21 at 21:58
  • The question's title was misleading, I have just updated it to reflect the real question. However, this response is exactly what I have been looking for: md5 a numpy array. – Adrian W Jun 17 '22 at 10:12