0

I have a big data set with tons of rows. I have one column in that data set with long row values. I want to rename these row values with shorter names in pandas automatically. What should I do?

My data is something like this:

enter image description here

and I want an output like this:

enter image description here

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Me0002
  • 19
  • 7

2 Answers2

3

What you are looking for is the pd.factorize function which encodes the different patterns of objects as an enumerated type (with different serial numbers). You can use it as follows:

df['Col1'] = 'C' + pd.Series(pd.factorize(df['Col1'])[0] + 1, dtype='string')

or if your Pandas version does not support string dtype, use:

df['Col1'] = 'C' + pd.Series(pd.factorize(df['Col1'])[0] + 1).astype(str) 

Demo

Data Input

data = {'Col1': ['XXXXXXXXXXXXXX', 'YYYYYYYYYYYYYY', 'XXXXXXXXXXXXXX', 'YYYYYYYYYYYYYY', 'XXXXXXXXXXXXXX', 'ZZZZZZZZZZZZZZ']}
df = pd.DataFrame(data)

print(df) 


             Col1
0  XXXXXXXXXXXXXX
1  YYYYYYYYYYYYYY
2  XXXXXXXXXXXXXX
3  YYYYYYYYYYYYYY
4  XXXXXXXXXXXXXX
5  ZZZZZZZZZZZZZZ

Output:

print(df)

  Col1
0   C1
1   C2
2   C1
3   C2
4   C1
5   C3
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • 2
    It is also possible to set the dtype instead of making another copy with `astype`. `df['Col1'] = 'C' + pd.Series(pd.factorize(df['Col1'])[0] + 1, dtype='string')` – Henry Ecker Oct 09 '21 at 18:59
  • 1
    I got an error with this code " the data type string is not understood" then I changed the code in this way : df['Col1'] = 'C' + pd.Series(pd.factorize(df['Col1'])[0] + 1).astype(str) and it works perfectly. Thanks @SeaBean for your help – Me0002 Oct 10 '21 at 09:36
0

Use:

df['col1'] = 'C' + (df.groupby('Col1').ngroup() + 1).astype(str)
Muhammad Hassan
  • 4,079
  • 1
  • 13
  • 27