28

pandas.factorize encodes input values as an enumerated type or categorical variable.

But how can I easily and efficiently convert many columns of a data frame? What about the reverse mapping step?

Example: This data frame contains columns with string values such as "type 2" which I would like to convert to numerical values - and possibly translate them back later.

enter image description here

clstaudt
  • 21,436
  • 45
  • 156
  • 239

3 Answers3

51

You can use apply if you need to factorize each column separately:

df = pd.DataFrame({'A':['type1','type2','type2'],
                   'B':['type1','type2','type3'],
                   'C':['type1','type3','type3']})

print (df)
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3

print (df.apply(lambda x: pd.factorize(x)[0]))
   A  B  C
0  0  0  0
1  1  1  1
2  1  2  1

If you need for the same string value the same numeric one:

print (df.stack().rank(method='dense').unstack())
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

If you need to apply the function only for some columns, use a subset:

df[['B','C']] = df[['B','C']].stack().rank(method='dense').unstack()
print (df)
       A    B    C
0  type1  1.0  1.0
1  type2  2.0  3.0
2  type2  3.0  3.0

Solution with factorize:

stacked = df[['B','C']].stack()
df[['B','C']] = pd.Series(stacked.factorize()[0], index=stacked.index).unstack()
print (df)
       A  B  C
0  type1  0  0
1  type2  1  2
2  type2  2  2

Translate them back is possible via map by dict, where you need to remove duplicates by drop_duplicates:

vals = df.stack().drop_duplicates().values
b = [x for x in df.stack().drop_duplicates().rank(method='dense')]

d1 = dict(zip(b, vals))
print (d1)
{1.0: 'type1', 2.0: 'type2', 3.0: 'type3'}

df1 = df.stack().rank(method='dense').unstack()
print (df1)
     A    B    C
0  1.0  1.0  1.0
1  2.0  2.0  3.0
2  2.0  3.0  3.0

print (df1.stack().map(d1).unstack())
       A      B      C
0  type1  type1  type1
1  type2  type2  type3
2  type2  type3  type3
clstaudt
  • 21,436
  • 45
  • 156
  • 239
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • And if I do not want to apply the function to every column, just to a list of columns? – clstaudt Sep 08 '16 at 12:01
  • You can use subset, give me a sec. – jezrael Sep 08 '16 at 12:02
  • If i have to a large dataset (100 categorical values and 50 numerical values) how to apply your last method to translate them back .. just for catégorical values. I mean just encode the categorical variables and do not touch the others variables. – Ib D Jan 28 '19 at 11:49
  • @IbD - How is possible distinguish catgorical columns? LIke `df = df.select_dtypes(object)` ? – jezrael Jan 28 '19 at 11:53
  • Yes, from my side i have something lake : char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index for c in char_cols: df[c] = pd.factorize(df[c], na_sentinel = 1)[0] – Ib D Jan 28 '19 at 13:00
  • The target is to separate the categorical columns (dtype = object) and the others and to do same job lake your last part in your answere. – Ib D Jan 28 '19 at 13:15
11

I also found this answer quite helpful: https://stackoverflow.com/a/20051631/4643212

I was trying to take values from an existing column in a Pandas DataFrame (a list of IP addresses named 'SrcIP') and map them to numerical values in a new column (named 'ID' in this example).

Solution:

df['ID'] = pd.factorize(df.SrcIP)[0]

Result:

        SrcIP | ID    
192.168.1.112 |  0  
192.168.1.112 |  0  
192.168.4.118 |  1 
192.168.1.112 |  0
192.168.4.118 |  1
192.168.5.122 |  2
192.168.5.122 |  2
...
Gabe F.
  • 183
  • 2
  • 9
0

I would like to redirect my answer: https://stackoverflow.com/a/32011969/1694714

Old answer

Another readable solution for this problem, when you want to keep the categories consistent across the the resulting DataFrame is using replace:

def categorise(df):
    categories = {k: v for v, k in enumerate(df.stack().unique())}
    return df.replace(categories)

Performs slightly worse than the example by @jezrael, but easier to read. Also, it might escalate better for bigger datasets. I can do some proper testing if anyone is interested.

Community
  • 1
  • 1
tbrittoborges
  • 965
  • 1
  • 6
  • 16