0

As part of a work project I am porting a Perl library to Python. I'm comfortable with Python, much (much) less so with Perl.

The perl code uses Digest::MD5. This module has three functions:

  • md5($data) takes in data and spits out md5 digest in binary
  • md5_hex($data) takes in data and spits out md5 digest in hex
  • md5_base64($data) takes in data and spits out md5 digest in base64 encoding

I can replicate md5_hex with something like this:

import hashlib
string = 'abcdefg'
print(hashlib.md5(string.encode()).hexdigest())

Which works fine (same inputs give same outputs at least). I can't seem to get anything to match for the other two functions.

It doesn't help that string encodings are really not something I've done much with. I've been interpreting the perl functions as saying they take an md5 digest and then re-encode in binary or base64, something like this:

import hashlib
import base64
string = 'abcdefg'
md5_string = hashlib.md5(string.encode()).hexdigest()
print(base64.b64encode(md5_string))

but maybe that's wrong? I'm sure there's something fundamental I'm just missing.

The Perl doc is here: https://metacpan.org/pod/Digest::MD5

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
user1781837
  • 73
  • 2
  • 9
  • 1
    Clearly, you don't want to Base64 encode the hex representation of the hash, but the hash itself. – Sinan Ünür Oct 17 '16 at 20:08
  • If you want to calculate the MD5 of big files, consider this solution: http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python/40961519#40961519 – Laurent LAPORTE Dec 04 '16 at 18:00

2 Answers2

3

The first one would simply be calling .digest method on the md5:

>>> from hashlib import md5
>>> s = 'abcdefg'
>>> md5(s.encode()).digest()
b'z\xc6l\x0f\x14\x8d\xe9Q\x9b\x8b\xd2d1,Md'

And md5_base64 is the digest but base64-encoded:

>>> base64.b64encode(md5(s.encode()).digest())
b'esZsDxSN6VGbi9JkMSxNZA=='

However, Perl doesn't return the hash padded, thus to be compatible, you'd strip the = padding characters:

>>> base64.b64encode(md5(s.encode()).digest()).strip(b'=')
b'esZsDxSN6VGbi9JkMSxNZA'
0

First, note Digest::MD5 documentation:

Note that the base64 encoded string returned is not padded to be a multiple of 4 bytes long. If you want interoperability with other base64 encoded md5 digests you might want to append the redundant string "==" to the result.

Second, note that you want to Base64 encode the hash, not the hex representation of it:

print(base64.b64encode(hashlib.md5(string.encode()).digest()))

esZsDxSN6VGbi9JkMSxNZA==

perl -MDigest::MD5=md5_base64 -E 'say md5_base64($ARGV[0])' abcdefg

esZsDxSN6VGbi9JkMSxNZA

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • I chose Antti Haapala's answer because he also covered the binary case but this is very good for the base64 part. Thank you! – user1781837 Oct 17 '16 at 20:21