0

My dataframe looks like this:

col1  col2  col3
Aba   xxx   yyy
bab   bhh   jjj
ccc   kkk   lll
Aba   xxx   yyy
ccc   kkk   jjj

I want to replace unique values of each column with something like: In col1: Aba gets replaced with a0,bab gets replaced with a1, ccc gets replaced with a2 whereever it appears in the column.

Similarly with col2: xxx gets replaced with b0,bhh gets replaced with b1 etc.

In short first column starts replacing unique values with a0,a1,a2,a3 2nd one with b0,b1,b2,b3, third column with c0,c1,c2.. etc .

Looking for a solution in loop so that I can process all the columns and all the unique values all together coz I have more than a million rows.

Thanks,

user1111
  • 321
  • 2
  • 4
  • 16

2 Answers2

3

I'd do it this way:

In [184]: ['a','b','c'] + df.apply(lambda x: pd.factorize(x)[0]).astype(str)
Out[184]:
  col1 col2 col3
0   a0   b0   c0
1   a1   b1   c1
2   a2   b2   c2
3   a0   b0   c0
4   a2   b2   c1

a bit more generic approach:

import string

c = list(string.ascii_lowercase)[:len(df.columns)]
df1 = c + df.apply(lambda x: pd.factorize(x)[0]).astype(str)
df1
  col1 col2 col3
0   a0   b0   c0
1   a1   b1   c1
2   a2   b2   c2
3   a0   b0   c0
4   a2   b2   c1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

Here is a numpy solution. It should be efficient as list comprehension often faster than apply + lambda.

Source for alphabet range: Alphabet range python

import pandas as pd
from string import ascii_lowercase

df = pd.DataFrame({'col1': {0: 'Aba', 1: 'bab', 2: 'ccc', 3: 'Aba', 4: 'ccc'},
                   'col2': {0: 'xxx', 1: 'bhh', 2: 'kkk', 3: 'xxx', 4: 'kkk'},
                   'col3': {0: 'yyy', 1: 'jjj', 2: 'lll', 3: 'yyy', 4: 'jjj'}})

a = df.values
f = np.array([np.unique(a[:, i], return_inverse=True)[1] for i in range(a.shape[1])]).T

res = list(ascii_lowercase[:a.shape[1]]) + \
      pd.DataFrame(f.astype(str), columns=df.columns)

#   col1 col2 col3
# 0   a0   b2   c2
# 1   a1   b0   c0
# 2   a2   b1   c1
# 3   a0   b2   c2
# 4   a2   b1   c0
jpp
  • 159,742
  • 34
  • 281
  • 339
  • It gives an error `TypeError: unorderable types: int() < str()` – user1111 Mar 08 '18 at 14:50
  • No, it doesn't. Please see update for complete code. – jpp Mar 08 '18 at 14:53
  • It gives `TypeError: unorderable types: int() < str()` if any of the value in the row is 0. – user1111 Mar 08 '18 at 15:03
  • With `pandas`, you should not use mixed types if possible. Convert your columns to strings via `df[col] = df[col].astype(str)` if necessary. – jpp Mar 08 '18 at 15:05
  • Thanks for your reply anyway. The previous solution from @MaxU worked perfect for my set of data in one line. – user1111 Mar 08 '18 at 15:07
  • Sure, I just wanted to reiterate that "best answer" is subjective. Many 1-line answers in Python are sub-optimal. – jpp Mar 08 '18 at 15:17