Convert categorical data in pandas dataframe

Question

I have a dataframe with this type of data (too many columns):

col1        int64
col2        int64
col3        category
col4        category
col5        category

Columns look like this:

Name: col3, dtype: category
Categories (8, object): [B, C, E, G, H, N, S, W]

I want to convert all the values in each column to integer like this:

[1, 2, 3, 4, 5, 6, 7, 8]

I solved this for one column by this:

dataframe['c'] = pandas.Categorical.from_array(dataframe.col3).codes

Now I have two columns in my dataframe - old col3 and new c and need to drop old columns.

That's bad practice. It works but in my dataframe there are too many columns and I don't want do it manually.

How can I do this more cleverly?

score 231 · Accepted Answer · answered Aug 14 '15 at 14:01

231

First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.

First making an example dataframe:

In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})

In [76]: df['col2'] = df['col2'].astype('category')

In [77]: df['col3'] = df['col3'].astype('category')

In [78]: df.dtypes
Out[78]:
col1       int64
col2    category
col3    category
dtype: object

Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:

In [80]: cat_columns = df.select_dtypes(['category']).columns

In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')

In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [84]: df
Out[84]:
   col1  col2  col3
0     1     0     0
1     2     1     1
2     3     2     0
3     4     0     1
4     5     1     1

answered Aug 14 '15 at 14:01

joris

133,120
36
247
202

20

is there a easy way we get a mapping between category code and category string values? – Allan Ruin Jul 28 '16 at 08:51
7

You can use: `df['col2'].cat.categories` for instance. – ogrisel Oct 08 '16 at 13:56
15

Pointing out for anyone concerned that this will map `NaN`'s uniquely to `-1` – crackpotHouseplant Apr 07 '17 at 23:44
2

Love the 2 liners ;) – Jose A Jul 18 '18 at 14:09
2

Watch out that if the categorical is ordered (an ordinal) then the numerical codes returned by `cat.codes` may NOT be the ones you see in the Series ! – paulperry Feb 07 '19 at 22:09
In this example `col1` and `col2` will be the same and this may NOT be what you want: ```df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':[5,4,3,2,1], 'col3':list('ababb')}) df.col1 = df.col1.astype('category', ordered=False) df.col2 = df.col2.astype(pd.api.types.CategoricalDtype(categories=[5,4,3,2,1], ordered=True))``` – paulperry Feb 07 '19 at 22:26
The code in my answer works exactly the same for ordered or non-ordered categoricals. – joris Feb 08 '19 at 07:23
I tried to convert directly one column with _df[''col2'] = _df['col2'].apply(lambda x: x.cat.codes) but I get the error `AttributeError: 'str' object has no attribute 'astype' ` even though the column is a categorical type. – Zioalex Jul 15 '20 at 16:11

score 82 · Answer 2 · answered Sep 12 '17 at 23:20

82

This works for me:

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

Output:

[0, 1, 2, 0]

answered Sep 12 '17 at 23:20

scottlittle

18,866
8
51
70

5

underrated answer – aryanknp Jan 07 '21 at 13:04
7

great, much simpler than the accepted answer – Moritz Jan 19 '21 at 18:12
4

I agree, this is a very good and efficient answer – Laurent Apr 09 '21 at 07:34
4

best answer, imho – Lécio Bourbon Dec 22 '21 at 12:48
1

While this solves the problem, you should prefer the accessor `pd.Series.cat.codes` over `pd.factorize` for performance reasons. Internally, a categorical is already a list of indices and extracting that via `.cat.codes` takes `O(0)` time, whereas (re-)factorizing via `pd.factorize` takes `O(n)` time ([currently](https://github.com/pandas-dev/pandas/blob/5d17d73be969a7d004a7e1035afbf811a8fbf18a/pandas/core/algorithms.py#L634-L812) no fast-path for categorical). – FirefoxMetzger Feb 08 '23 at 09:26
Then, nowadays what should we use, pd.factorize(), astype('category') or pd.Categorical()? – skan Jul 26 '23 at 11:56

score 26 · Answer 3 · answered May 27 '17 at 05:56

If your concern was only that you making a extra column and deleting it later, just dun use a new column at the first place.

dataframe = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})
dataframe.col3 = pd.Categorical.from_array(dataframe.col3).codes

You are done. Now as Categorical.from_array is deprecated, use Categorical directly

dataframe.col3 = pd.Categorical(dataframe.col3).codes

If you also need the mapping back from index to label, there is even better way for the same

dataframe.col3, mapping_index = pd.Series(dataframe.col3).factorize()

check below

print(dataframe)
print(mapping_index.get_loc("c"))

score 17 · Answer 4 · answered Feb 15 '18 at 06:07

Here multiple columns need to be converted. So, one approach i used is ..

for col_name in df.columns:
    if(df[col_name].dtype == 'object'):
        df[col_name]= df[col_name].astype('category')
        df[col_name] = df[col_name].cat.codes

This converts all string / object type columns to categorical. Then applies codes to each type of category.

score 16 · Answer 5 · answered Jul 06 '20 at 10:32

16

What I do is, I replace values.

Like this-

df['col'].replace(to_replace=['category_1', 'category_2', 'category_3'], value=[1, 2, 3], inplace=True)

In this way, if the col column has categorical values, they get replaced by the numerical values.

answered Jul 06 '20 at 10:32

truth

329
3
14

score 8 · Answer 6 · answered Jan 09 '20 at 06:52

For converting categorical data in column C of dataset data, we need to do the following:

from sklearn.preprocessing import LabelEncoder 
labelencoder= LabelEncoder() #initializing an object of class LabelEncoder
data['C'] = labelencoder.fit_transform(data['C']) #fitting and transforming the desired categorical column.

score 8 · Answer 7 · answered Mar 17 '21 at 16:23

8

To convert all the columns in the Dataframe to numerical data:

df2 = df2.apply(lambda x: pd.factorize(x)[0])

answered Mar 17 '21 at 16:23

sauravjoshi23

837
11
9

what is I have a column with 0/1/2 and I want to transform it into 'A', 'B', 'C' ? – seralouk Oct 31 '22 at 10:36

score 5 · Answer 8 · answered Dec 05 '20 at 00:03

5

Answers here seem outdated. Pandas now has a factorize() function and you can create categories as:

df.col.factorize()

Function signature:

pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)

answered Dec 05 '20 at 00:03

Hamza

5,373
3
28
43

score 3 · Answer 9 · answered Sep 28 '20 at 08:46

One of the simplest ways to convert the categorical variable into dummy/indicator variables is to use get_dummies provided by pandas. Say for example we have data in which sex is a categorical value (male & female) and you need to convert it into a dummy/indicator here is how to do it.

tranning_data = pd.read_csv("../titanic/train.csv")
features = ["Age", "Sex", ] //here sex is catagorical value
X_train = pd.get_dummies(tranning_data[features])
print(X_train)

Age Sex_female Sex_male
20    0          1
33    1          0
40    1          0
22    1          0
54    0          1

this is the exact pythonic way i was looking for! thank you! — Nidhi Agarwal, Mar 03 '21 at 03:32

score 3 · Answer 10 · answered Dec 31 '21 at 11:16

3

you can use .replace as the following:

df['col3']=df['col3'].replace(['B', 'C', 'E', 'G', 'H', 'N', 'S', 'W'],[1,2,3,4,5,6,7,8])

or .map:

df['col3']=df['col3'].map({1: 'B', 2: 'C', 3: 'E', 4:'G', 5:'H', 6:'N', 7:'S', 8:'W'})

answered Dec 31 '21 at 11:16

Mingming Qiu

333
4
9

score 3 · Answer 11 · answered Jan 06 '22 at 20:01

3

categorical_columns =['sex','class','deck','alone']

for column in categorical_columns:
     df[column] = pd.factorize(df[column])[0]

Factorize will make each unique categorical data in a column into a specific number (from 0 to infinity).

answered Jan 06 '22 at 20:01

irtexas19

55
6

score 1 · Answer 12 · edited Jul 30 '17 at 22:47

1

@Quickbeam2k1 ,see below -

dataset=pd.read_csv('Data2.csv')
np.set_printoptions(threshold=np.nan)
X = dataset.iloc[:,:].values

Using sklearn enter image description here

from sklearn.preprocessing import LabelEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

edited Jul 30 '17 at 22:47

gofr1

15,741
11
42
52

answered Jul 30 '17 at 22:19

Prohadoopian

29
3

3

Why didn't you just correct your previous answer? Surprisingly, you are using `fit_transform` now instead of `transform_fit`and corrected the labelencoder definition. Why do you use `iloc[:,:]`? this is useless. What is the reason behind the image? In case you wanted to prove me and @theGtknerd wrond you failed. – Quickbeam2k1 Jul 31 '17 at 05:37

score 1 · Answer 13 · answered Oct 17 '20 at 04:14

You can do it less code like below :

f = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),'col3':list('ababb')})

f['col1'] =f['col1'].astype('category').cat.codes
f['col2'] =f['col2'].astype('category').cat.codes
f['col3'] =f['col3'].astype('category').cat.codes

f

score 1 · Answer 14 · edited Jan 17 '22 at 18:33

1

Just use manual matching:

dict = {'Non-Travel':0, 'Travel_Rarely':1, 'Travel_Frequently':2}

df['BusinessTravel'] = df['BusinessTravel'].apply(lambda x: dict.get(x))

edited Jan 17 '22 at 18:33

Anas Imran Tasadduq

584
9
26

answered Jan 16 '22 at 12:03

Piotr Rząsa

11
4

SaTa · Answer 15 · 2020-03-27T20:16:28.487

0

For a certain column, if you don't care about the ordering, use this

df['col1_num'] = df['col1'].apply(lambda x: np.where(df['col1'].unique()==x)[0][0])

If you care about the ordering, specify them as a list and use this

df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third'].index(x))

edited Mar 27 '20 at 20:16

answered Mar 27 '20 at 19:53

SaTa

2,422
2
14
26

score 0 · Answer 16 · answered Apr 04 '23 at 22:52

0

you can use something like this

df['Grade'].replace(['A', 'B', 'C'], [0, 1, 2], inplace=True)

use the inplace argument if so that you dont perform a copy. you select a colume and replace the distinct there with the one you want.

answered Apr 04 '23 at 22:52

Allaye

833
6
18

Convert categorical data in pandas dataframe

16 Answers16

Linked

Related