Pandas dataframe replace unique values in a column

Question

My dataframe looks like this:

col1  col2  col3
Aba   xxx   yyy
bab   bhh   jjj
ccc   kkk   lll
Aba   xxx   yyy
ccc   kkk   jjj

I want to replace unique values of each column with something like: In col1: Aba gets replaced with a0,bab gets replaced with a1, ccc gets replaced with a2 whereever it appears in the column.

Similarly with col2: xxx gets replaced with b0,bhh gets replaced with b1 etc.

In short first column starts replacing unique values with a0,a1,a2,a3 2nd one with b0,b1,b2,b3, third column with c0,c1,c2.. etc .

Looking for a solution in loop so that I can process all the columns and all the unique values all together coz I have more than a million rows.

Thanks,

score 3 · Accepted Answer · edited Mar 08 '18 at 13:45

3

I'd do it this way:

In [184]: ['a','b','c'] + df.apply(lambda x: pd.factorize(x)[0]).astype(str)
Out[184]:
  col1 col2 col3
0   a0   b0   c0
1   a1   b1   c1
2   a2   b2   c2
3   a0   b0   c0
4   a2   b2   c1

a bit more generic approach:

import string

c = list(string.ascii_lowercase)[:len(df.columns)]
df1 = c + df.apply(lambda x: pd.factorize(x)[0]).astype(str)
df1
  col1 col2 col3
0   a0   b0   c0
1   a1   b1   c1
2   a2   b2   c2
3   a0   b0   c0
4   a2   b2   c1

edited Mar 08 '18 at 13:45

jezrael

822,522
95
1,334
1,252

answered Mar 08 '18 at 13:27

MaxU - stand with Ukraine

205,989
36
386
419

@jezrael, thank you! I didn't get it - what "bar" do you mean? – MaxU - stand with Ukraine Mar 08 '18 at 13:42
1

I modify answer for remove it ;) – jezrael Mar 08 '18 at 13:45
@MaxU - The best and the smartest solution! Thank you so much :) – user1111 Mar 08 '18 at 13:45
@jezrael, ah, i got it now - you were talking about iPython prompt... Thanks for editing the answer !:-) – MaxU - stand with Ukraine Mar 08 '18 at 13:46
@user1111, glad i could help! :-) – MaxU - stand with Ukraine Mar 08 '18 at 13:47
1

No, I think html bar if long line in answer ;) – jezrael Mar 08 '18 at 13:47
@user1111, while this solution is very good, I'm not convinced it's the best, certainly not from an efficiency perspective (using `apply` to loop columns). I've added another solution so at least you can compare. – jpp Mar 08 '18 at 14:06

jpp · Answer 2 · 2018-03-08T14:53:05.420

0

Here is a numpy solution. It should be efficient as list comprehension often faster than apply + lambda.

Source for alphabet range: Alphabet range python

import pandas as pd
from string import ascii_lowercase

df = pd.DataFrame({'col1': {0: 'Aba', 1: 'bab', 2: 'ccc', 3: 'Aba', 4: 'ccc'},
                   'col2': {0: 'xxx', 1: 'bhh', 2: 'kkk', 3: 'xxx', 4: 'kkk'},
                   'col3': {0: 'yyy', 1: 'jjj', 2: 'lll', 3: 'yyy', 4: 'jjj'}})

a = df.values
f = np.array([np.unique(a[:, i], return_inverse=True)[1] for i in range(a.shape[1])]).T

res = list(ascii_lowercase[:a.shape[1]]) + \
      pd.DataFrame(f.astype(str), columns=df.columns)

#   col1 col2 col3
# 0   a0   b2   c2
# 1   a1   b0   c0
# 2   a2   b1   c1
# 3   a0   b2   c2
# 4   a2   b1   c0

edited Mar 08 '18 at 14:53

answered Mar 08 '18 at 13:49

jpp

159,742
34
281
339

It gives an error `TypeError: unorderable types: int() < str()` – user1111 Mar 08 '18 at 14:50
No, it doesn't. Please see update for complete code. – jpp Mar 08 '18 at 14:53
It gives `TypeError: unorderable types: int() < str()` if any of the value in the row is 0. – user1111 Mar 08 '18 at 15:03
With `pandas`, you should not use mixed types if possible. Convert your columns to strings via `df[col] = df[col].astype(str)` if necessary. – jpp Mar 08 '18 at 15:05
Thanks for your reply anyway. The previous solution from @MaxU worked perfect for my set of data in one line. – user1111 Mar 08 '18 at 15:07
Sure, I just wanted to reiterate that "best answer" is subjective. Many 1-line answers in Python are sub-optimal. – jpp Mar 08 '18 at 15:17

Pandas dataframe replace unique values in a column

2 Answers2