6

Let us say I have the following simple data frame. But in reality, I have hundreds thousands of rows like this.

df

ID              Sales
倀굖곾ꆹ譋῾理     100
倀굖곾ꆹ         50
倀굖곾ꆹ譋῾理     70
곾ꆹ텊躥㫆        60

My idea is that I want to replace the Chinese digit with randomly generated 8 digits something looks like below.

ID              Sales
13434535        100
67894335         50
13434535         70
10986467         60

The digits are randomly generated but they should keep uniqueness as well. For example, row 0 and 2 are same and when it replaced by a random unique ID, it should be the same as well.

Can anyone help on this in Python pandas? Any solution that is already done before is also welcome.

Hiwot
  • 568
  • 5
  • 18

4 Answers4

11

The primary method here will be to use Series.map() on the 'ID's to assign the new values.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

which is exactly what you're looking for.

Here are some options for generating the new IDs:

1. Randomly generated 8-digit integers, as asked

You can first create a map of randomly generated 8-digit integers with each of the unique ID's in the dataframe. Then use Series.map() on the 'ID's to assign the new values back. I've included a while loop to ensure that the generated ID's are unique.

import random

original_ids = df['ID'].unique()
while True:
    new_ids = {id_: random.randint(10_000_000, 99_999_999) for id_ in original_ids}
    if len(set(new_ids.values())) == len(original_ids):
        # all the generated id's were unique
        break
    # otherwise this will repeat until they are

df['ID'] = df['ID'].map(new_ids)

Output:

         ID  Sales
0  91154173    100
1  27127403     50
2  91154173     70
3  55892778     60

Edit & Warning: The original ids are Chinese characters and they are already length 8. There's definitely more than 10 Chinese characters so with the wrong combination of original IDs, it could become impossible to make unique-enough 8-digit IDs for the new set. Unless you are memory bound, I'd recommend using 16-24 digits. Or even better...

2. Use UUIDs. [IDEAL]

You can still use the "integer" version of the ID instead of hex. This has the added benefit of not needing to check for uniqueness:

import uuid

original_ids = df['ID'].unique()
new_ids = {cid: uuid.uuid4().int for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)

(If you are okay with hex id's, change uuid.uuid4().int above to uuid.uuid4().hex.)

Output:

                                        ID  Sales
0   10302456644733067873760508402841674050    100
1   99013251285361656191123600060539725783     50
2   10302456644733067873760508402841674050     70
3  112767087159616563475161054356643068804     60

2.B. Smaller numbers from UUIDs

If the ID generated above is too long, you could truncate it, with some minor risk. Here, I'm only using the first 16 hex characters and converting those to an int. You may put that in the uniqueness loop check as done for option 1, above.

import uuid

original_ids = df['ID'].unique()
DIGITS = 16  # number of hex digits of the UUID to use
new_ids = {cid: int(uuid.uuid4().hex[:DIGITS], base=16) for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)

Output:

                     ID  Sales
0  14173925717660158959    100
1  10599965012234224109     50
2  14173925717660158959     70
3  13414338319624454663     60

3. Creating a mapping based on the actual value:

This group of options has these advantages:

  • not needing a uniqueness check since it's deterministically based on the original ID and
    • So original IDs which were the same will generate the same new ID
  • doesn't need a map created in advance

3.A. CRC32

(Higher probability of finding a collision with different IDs, compared to option 2.B. above.)

import zlib

df['ID'] = df['ID'].map(lambda cid: zlib.crc32(bytes(cid, 'utf-8')))

Output:

           ID  Sales
0  2083453980    100
1  1445801542     50
2  2083453980     70
3   708870156     60

3.B. Python's built-in hash() of the orignal ID [My preferred approach in this scenario]

  • Can be done in one line, no imports needed
  • Reasonably secure to not generate collisions for IDs which are different
df['ID'] = df['ID'].map(hash)

Output:

                    ID  Sales
0  4663892623205934004    100
1  1324266143210735079     50
2  4663892623205934004     70
3  6251873913398988390     60

3.C. MD5Sum, or anything from hashlib

Since the IDs are expected to be small (8 chars), even with MD5, the probability of a collision is very low.

import hashlib

DIGITS = 16  # number of hex digits of the hash to use
df['ID'] = df['ID'].str.encode('utf-8').map(lambda x: int(hashlib.md5(x).hexdigest()[:DIGITS], base=16))

Output:

                     ID  Sales
0  17469287633857111608    100
1   4297816388092454656     50
2  17469287633857111608     70
3  11434864915351595420     60
aneroid
  • 12,983
  • 3
  • 36
  • 66
  • Also probably as it is important data we don't want mistakes, so would be great to control uniqueness of generated random ids. If there are thousands of rows it could happen rarely that `randint()` will generate twice same id. – Arty Feb 25 '21 at 10:44
  • The solution is good, but there is a very low probability of the mapped id non being unique. – tino Feb 25 '21 at 10:46
  • @Arty Good point, I'll update the answer. – aneroid Feb 25 '21 at 10:47
  • @aneroid, I have checked it. It couldn't keep the uniqueness during mapping. Can modify it a bit? – Hiwot Feb 25 '21 at 10:51
  • 1
    @Hiwot See the update posted seconds before your comment. The downside is that your program will go into an infinite loop until unique ids are found... – aneroid Feb 25 '21 at 10:52
  • @Hiwot To avoid long loop waiting for uniqueness you can just use 9 digits ids instead of 8 digits, this will converge much faster. – Arty Feb 25 '21 at 10:54
  • @Hiwot Made some edits to include safer mapping options. I'd recommend option **2b** above. – aneroid Feb 25 '21 at 11:33
  • @Arty Agreed about 8 digits being too short. But by that logic, even 9 is not safe. (12 might work with some luck). I've added a warning about that and some better options. – aneroid Feb 25 '21 at 12:01
  • @aneroid Non-uniqueness of random numbers can be solved differently. You generate a bit more random numbers in total, e.g. 10% more than needed. Then you remove duplicates. That's it, now with very high chance you'll have all ids unique even with 8 digits. – Arty Feb 25 '21 at 12:10
  • @Hiwot Updated with more options. Personally, I prefer option 3.B. using Python's builtin [`hash()`](https://docs.python.org/3/library/functions.html#hash) of the orignal ID. – aneroid Feb 26 '21 at 21:27
4

Not very expert in Pandas, that's why implementing solution for you with Numpy + Pandas. As solution uses fast Numpy it means it will be much faster than pure Python solution especially if you have thousands of rows.

Try it online!

import pandas as pd, numpy as np
df = pd.DataFrame([
    ['倀굖곾ꆹ譋῾理', 100],
    ['倀굖곾ꆹ', 50],
    ['倀굖곾ꆹ譋῾理', 70],
    ['곾ꆹ텊躥㫆', 60],
], columns = ['ID', 'Sales'])
u, iv = np.unique(df.ID.values, return_inverse = True)
while True:
    ids = np.random.randint(10 ** 7, 10 ** 8, u.size)
    if np.all(np.unique(ids, return_counts = True)[1] <= 1):
        break
df.ID = ids[iv]
print(df)

Output:

         ID  Sales
0  31043191    100
1  36168634     50
2  31043191     70
3  17162753     60
Arty
  • 14,883
  • 6
  • 36
  • 69
2

I would:

  • identify the unique ID values
  • build (from np.random) an array of unique values of same size
  • build a tranformation dataframe with that array
  • use merge to replace the original ID values

Possible code:

trans = df[['ID']].drop_duplicates()        # unique ID values
n = len(trans)

# np.random.seed(0)       # uncomment for reproducible pseudo random sequences

while True:
    # build a greater array to have a higher chance to get enough unique values
    arr = np.unique(np.random.randint(10000000, 100000000, n + n // 2))
    if len(arr) >= n:
        arr = arr[:n]             # ok keep only the required number
        break

trans['new'] = arr                # ok we have our transformation table

df['ID'] = df.merge(trans, how='left', on='ID')['new']   # done...

With your sample data (and with np.random.seed(0)), it gives:

         ID  Sales
0  12215104    100
1  48712131     50
2  12215104     70
3  70969723     60

Per @Arty's comment, np.unique will return a ascending sequence. If you do not want that, shuffle it before using it for the transformation table:

...
np.random.shuffle(arr)
trans['new'] = arr
...
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • BTW, `np.random.randint()` doesn't include upper bound so your upper bound should be `10 ** 8`. This is difference from standard python's `random.randint()` which includes upper bound. – Arty Feb 25 '21 at 10:48
  • @Arty I have edited my post from your comment. Interestingly, it did not change the values... – Serge Ballesta Feb 25 '21 at 10:51
  • Also, `np.unique()` returns sorted list of integers. Don't know if it is issue or not, but in this case chinese ids will be replaced by increasing random ids. They are not just to random, don't know if it matters. – Arty Feb 25 '21 at 10:52
  • @Arty's Thank you for your comment, because I was not aware of that. I have edited my post... – Serge Ballesta Feb 25 '21 at 10:58
1

Given a dataframe df, create a list of the ids:

id_list = list(df.ID)

Then import the random package

from random import randint
from collections import deque

def idSetToNumber(id_list):
    id_set = deque(set(id_list))
    checked_numbers = []
    while len(id_set)>0:
        #get the id
        id = randint(10000000,99999999)
        #check if the id has been used
        if id not in checked_numbers:
            checked_numbers.append(id)
            id_set.popleft()
    return checked_numbers

This gives a list of unique 8-digit number for each of your keys. Then create a dictionary

checked_numbers = idSetToNumber(id_list)
name2id = {}
for i in range(len(checked_numbers)):
    name2id[id_list[i]]=checked_numbers[i]

Last step, replace all the pandas ID fields with the ones in the dictionary.

for i in range(df.shape[0]):
    df.ID[i] = str(name2id[df.ID[i]])
    
tino
  • 160
  • 6