0

I am working with a dataset that contains names as strings that needs to be published publicly, but without the original name being visible (ie I need to be able to distinguish the different names, but the end result needs to have something like "e7fx8yuo" where the original dataset had "John Doe").

The requirements for this method sound similar to the process of hashing, but with fewer requirements (ie I dont need variable length names to map to a single length hash), but the names need to map to a unique string (two different names cannot map to the same string).

I am planning on writing this in python, but Im not entirely sure exactly what the process I am looking to implement is called. If possible I would also like the 'hashed' end-product string to behave similarly to the way that github generates repository name suggestions ("reimagined-memory" instead of "e7fx8yuo" because a string of complete words is more memorable and easier to remember). Is there any module in python that can do this for me?

2 Answers2

2

As I said in a comment, this sounds like data masking. Here's a basic implementation:

from collections import defaultdict
from string import ascii_lowercase
from random import choice

random_strings = set()

def random_string():
    while True:
        result = ''.join(choice(ascii_lowercase) for _ in range(8))
        if result not in random_strings:
            random_strings.add(result)
            return result

masks = defaultdict(random_string)

print(masks['Adam'])
print(masks['Adam'])
print(masks['Bob'])

Output:

qmmwavuk
qmmwavuk
ykzlvfaf
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
  • That works quite well for me, the only issue is I sometimes have collisions if the names are relatively long compared to the length of the output strings (8 characters). Is there a definitive rule for how long the output string has to be compared to the max length of the string to avoid collisions? – BruceJohnJennerLawso Jul 05 '17 at 00:11
  • @BruceJohnJennerLawso I have no idea what you're talking about. The code is designed not to allow collisions (at least not in the final result) and nothing about the code is related to the lengths of the names. – Alex Hall Jul 05 '17 at 08:17
  • its having some collisions when applied across very large datasets. Running [this script](https://github.com/BruceJohnJennerLawso/scrap/blob/soupFlow/soup/obfuscant.py) produces collisions, such as: `u'Sherrie Andrew' -> 'hjoczvn' -> u'hardcopyJamsOvertimeConjunctionsZipVendorsNecks'` colliding with `u'Jada Nevaeh' -> 'hjoczvn' -> u'hardcopyJamsOvertimeConjunctionsZipVendorsNecks'` – BruceJohnJennerLawso Jul 05 '17 at 15:44
  • And I would imagine the generator could only be truly unique for names where the length of the name is less than 8 characters. If the list of names is all 9 characters or longer, theres no way you could map every possible combination of 9 characters to a unique string of 8 characters, it just doesnt work because there arent enough addresses to go around – BruceJohnJennerLawso Jul 05 '17 at 15:52
  • FYI, that script will take an eternity to run in the same way I did it, as I set the outputted key to be 7 characters long, and the number of random names generated to be 400000, but you can see something similar if you use length 6 keys and 40000 random unique names as well – BruceJohnJennerLawso Jul 05 '17 at 15:55
  • updated the original script to use 6 length keys and 40000 random unique names, youll need to `pip install RandomWords` as well as matplotlib to run it – BruceJohnJennerLawso Jul 05 '17 at 15:58
  • 1
    @BruceJohnJennerLawso I see now I forgot a line: `random_strings.add(result)`. That enforces uniqueness. If the keys have length 6 that's 26^6 > 300 million unique keys so there's plenty to go around. – Alex Hall Jul 05 '17 at 16:14
  • ah, so it just brute forces uniqueness for keys generated in the same run of the script. Works for me – BruceJohnJennerLawso Jul 05 '17 at 16:17
  • Alex Hall, any particular benefit for using a set for random_strings as opposed to a list? – BruceJohnJennerLawso Jul 05 '17 at 16:20
  • 1
    Yes, it's much faster. – Alex Hall Jul 05 '17 at 16:22
2

Here is something quick and dirty to do it


import string
import random


def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))     #with no arguments passed to this function it will return a 6 character string composed of letters and numbers


def makeID(names):

    nameDict = {}

    for i in names:
        var = id_generator()

        while var in nameDict:      #if the generator results already exist as a key we loop until we get a unique one
            var = id_generator()

        nameDict[var] = i     #Here we set our key as the generator results, and set the value to the current name in the list which in this case is 'i'


    print(nameDict,)



makeID(['John Doe','Jane NoDoe', 'Getsum MoDoe'])


Output:

{'H8WIAP': 'John Doe', '4NT7JC': 'Jane NoDoe', '208DBM': 'Getsum MoDoe'}


the random generator came from Random string generation with upper case letters and digits in Python

Khalif Ali
  • 46
  • 4