8

Good morning, All.

I want to convert my social security numbers to a md5 hash hex number. The outcome should be a unique md5 hash hex number for each social security number.

My data format is as follows:

ob = onboard[['regions','lname','ssno']][:10]
ob

    regions lname   ssno
0    Northern Region (R1)    Banderas    123456789
1    Northern Region (R1)    Garfield    234567891
2    Northern Region (R1)    Pacino  345678912
3    Northern Region (R1)    Baldwin     456789123
4    Northern Region (R1)    Brody   567891234
5    Northern Region (R1)    Johnson     6789123456
6    Northern Region (R1)    Guinness    7890123456
7    Northern Region (R1)    Hopkins     891234567
8    Northern Region (R1)    Paul    891234567
9    Northern Region (R1)    Arkin   987654321

I've tried the following code using hashlib:

import hashlib

ob['md5'] = hashlib.md5(['ssno'])

This gave me the error that it had to be a string not a list. So I tried the following:

ob['md5'] = hashlib.md5('ssno').hexdigest()



regions lname   ssno    md5
0    Northern Region (R1)    Banderas    123456789   a1b3ec3d8a026d392ad551701ad7881e
1    Northern Region (R1)    Garfield    234567891   a1b3ec3d8a026d392ad551701ad7881e
2    Northern Region (R1)    Pacino  345678912   a1b3ec3d8a026d392ad551701ad7881e
3    Northern Region (R1)    Baldwin     456789123   a1b3ec3d8a026d392ad551701ad7881e
4    Northern Region (R1)    Brody   567891234   a1b3ec3d8a026d392ad551701ad7881e
5    Northern Region (R1)    Johnson     678912345   a1b3ec3d8a026d392ad551701ad7881e
6    Northern Region (R1)    Johnson     789123456   a1b3ec3d8a026d392ad551701ad7881e
7    Northern Region (R1)    Guiness     891234567   a1b3ec3d8a026d392ad551701ad7881e
8    Northern Region (R1)    Hopkins     912345678   a1b3ec3d8a026d392ad551701ad7881e
9    Northern Region (R1)    Paul    159753456   a1b3ec3d8a026d392ad551701ad7881e

This was very close to what I need but all the hex numbers came out the same regardless if the social security number was different or not. I am trying to get a hex number with unique hex numbers for each social security number.

Any suggestions?

mfitzp
  • 15,275
  • 7
  • 50
  • 70
Dave
  • 6,968
  • 7
  • 26
  • 32
  • 1
    Do not hash social security numbers and think that it provides *any* sort of obfuscation. The social security number space is tiny, unsalted hashes of those are trivial for anyone to reverse. If you care about the privacy of the personal information you are hashing you should at the very least use the hmac module rather than just a straight up hash. – gps Feb 24 '15 at 01:27
  • Thank you very much for taking the time to respond with this comment! Extremely valued! I did not know that hashes could be reversed. I will look into the hmac module. Again thank you! – Dave Feb 24 '15 at 13:37

2 Answers2

15

hashlib.md5 takes a single string as input -- you can't pass it an array of values as you can with some NumPy/Pandas functions. So instead, you could use a list comprehension to build a list of md5sums:

ob['md5'] = [hashlib.md5(val).hexdigest() for val in ob['ssno']]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Absolutely, Beautiful! Makes sense. Thanks for educating me and assisting with a solution! Exactly what I needed! – Dave Feb 23 '15 at 12:55
  • 2
    For anyone hitting 'object supporting the buffer API required' error on this, it can be caused null (NaN) values in your Pandas series that may need to be processed or removed before hashing. – rocksteady Nov 26 '18 at 22:03
3

In case you are hashing to SHA256, you'll need to encode your string first to (probably) UTF-8:

ob['sha256'] = [hashlib.sha256(val.encode('UTF-8')).hexdigest() for val in ob['ssno']]
avibrazil
  • 311
  • 2
  • 10