How to change values in a column to fake values

Question

I want to change values from one column in a dataframe to fake data.

Here is the original table looking sample:

df = {'Name':['David', 'David', 'David', 'Kevin', 'Kevin', 'Ann', 'Joan']
'Age':[10,10,10,12,12,15,13]}
df = pd.DataFrame(df)
df

Now what I want to do is to change the Name column values to fake values like this:

df = {'Name':[A, A, A, B, B, C, D]
    'Age':[10,10,10,12,12,15,13]}
    df = pd.DataFrame(df)
    df

Notice how I changed the names to a distinct combination of Alphabets. this is sample data, but in real data, there are a lot of names, so I start with A,B,C,D then when it reaches Z, the next new name should be AA then AB follows, etc..

Is this viable?

score 1 · Accepted Answer · edited Dec 16 '20 at 20:31

Here is my suggestion. List 'fake' below has more than 23000 items, if your df has more unique values, just increase the end of the loop (currently 5) and the fake list will increase exponentially:

import string
from itertools import combinations_with_replacement

names=df['Name'].unique()

letters=list(string.ascii_uppercase)

fake=[]

for i in range(1,5): #increase 5 if you need more items
    fake.extend([i for i in combinations_with_replacement(letters,i)])

fake=[''.join(i) for i in fake]

d=dict(zip(names, fake))

df['code']=df.Name.map(d)

Sample of fake:

>>> print(fake[:30])
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', 'AC', 'AD']

Output:

>>>print(df)

    Name  Age code
0  David   10    A
1  David   10    A
2  David   10    A
3  Kevin   12    B
4  Kevin   12    B
5    Ann   15    C
6   Joan   13    D

nvm. I fixed that --- just reconnected the kernel. haha thanks! — Yun Tae Hwang, Dec 16 '20 at 20:22

BENY · Answer 2 · 2020-12-16T19:56:30.527

0

Use factorize and make the Fake name as int which is easy to store

df['Fake']=df.Name.factorize()[0]
df
    Name  Age  Fake
0  David   10     0
1  David   10     0
2  David   10     0
3  Kevin   12     1
4  Kevin   12     1
5    Ann   15     2
6   Joan   13     3

If need mix type

df.groupby('Name')['Name'].transform(lambda x : pd.util.testing.rands_array(8,1)[0])
0    jNAO9AdJ
1    jNAO9AdJ
2    jNAO9AdJ
3    es0p4Yjx
4    es0p4Yjx
5    x54NNbdF
6    hTMKxoXW
Name: Name, dtype: object

edited Dec 16 '20 at 19:56

answered Dec 16 '20 at 19:43

BENY

317,841
20
164
234

Yeah I agree with you that factorize is easier and more efficient, but what I need here is combinations of alphabets.. – Yun Tae Hwang Dec 16 '20 at 19:45
@YunTaeHwang check the update ,random number char with number mix type coding – BENY Dec 16 '20 at 19:56

score 0 · Answer 3 · answered Dec 16 '20 at 20:00

from string import ascii_lowercase
def excel_names(num_cols):
    letters = list(ascii_lowercase)
    excel_cols = []
    for i in range(0, num_cols - 1):
        n = i//26
        m = n//26
        i-=n*26
        n-=m*26
        col = letters[m-1]+letters[n-1]+letters[i] if m>0 else letters[n1]+letters[i] if n>0 else letters[i]
        excel_cols.append(col)
    return excel_cols


unique_names=df['Name'].nunique()+1
names=excel_names(unique_names)
dictionary=dict(zip(df['Name'].unique(),names))
df['new_Name']=df['Name'].map(dictionary)

excel_names reference from https://stackoverflow.com/questions/56452581/continous-alphabetic-list-in-python-and-getting-every-value-of-it — Ravi, Dec 16 '20 at 20:01

wwnde · Answer 4 · 2020-12-16T20:12:01.207

0

Get new integer category of names using cumsum and use Python ord,char TO turn the integer argument into strings starting from A

 df['Name']=(~(df.Name.shift(1)==df.Name)).cumsum().add(ord('A') - 1).map(chr)
print(df)



   Name  Age
0    A   10
1    A   10
2    A   10
3    B   12
4    B   12
5    C   15
6    D   13

edited Dec 16 '20 at 20:12

answered Dec 16 '20 at 20:02

wwnde

26,119
6
18
32

score 0 · Answer 5 · answered Dec 16 '20 at 20:32

let us think in another way. If you nead a fake sympol, so let us maping them to A0,A1,A2 to An. this would be more easy.

df = {'Name': ['David', 'David', 'David', 'Kevin', 'Kevin', 'Ann', 'Joan'], 'Age': [10, 10, 10, 12, 12, 15, 13]}
df = pd.DataFrame(df)
map = pd.DataFrame({'name': df['Name'].unique()})
map['seq'] = map.index
map['symbol'] = map['seq'].apply(lambda x: 'A' + str(x))
df['code'] = df['Name'].apply(lambda x: map.loc[map['name']==x]['symbol'].values)

df

    Name  Age code
0  David   10   A0
1  David   10   A0
2  David   10   A0
3  Kevin   12   A1
4  Kevin   12   A1
5    Ann   15   A2
6   Joan   13   A3

How to change values in a column to fake values

5 Answers5