155

I have a dataframe with this type of data (too many columns):

col1        int64
col2        int64
col3        category
col4        category
col5        category

Columns look like this:

Name: col3, dtype: category
Categories (8, object): [B, C, E, G, H, N, S, W]

I want to convert all the values in each column to integer like this:

[1, 2, 3, 4, 5, 6, 7, 8]

I solved this for one column by this:

dataframe['c'] = pandas.Categorical.from_array(dataframe.col3).codes

Now I have two columns in my dataframe - old col3 and new c and need to drop old columns.

That's bad practice. It works but in my dataframe there are too many columns and I don't want do it manually.

How can I do this more cleverly?

Gilaztdinov Rustam
  • 2,281
  • 5
  • 18
  • 22

16 Answers16

231

First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.

First making an example dataframe:

In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})

In [76]: df['col2'] = df['col2'].astype('category')

In [77]: df['col3'] = df['col3'].astype('category')

In [78]: df.dtypes
Out[78]:
col1       int64
col2    category
col3    category
dtype: object

Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:

In [80]: cat_columns = df.select_dtypes(['category']).columns

In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')

In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [84]: df
Out[84]:
   col1  col2  col3
0     1     0     0
1     2     1     1
2     3     2     0
3     4     0     1
4     5     1     1
joris
  • 133,120
  • 36
  • 247
  • 202
  • 20
    is there a easy way we get a mapping between category code and category string values? – Allan Ruin Jul 28 '16 at 08:51
  • 7
    You can use: `df['col2'].cat.categories` for instance. – ogrisel Oct 08 '16 at 13:56
  • 15
    Pointing out for anyone concerned that this will map `NaN`'s uniquely to `-1` – crackpotHouseplant Apr 07 '17 at 23:44
  • 2
    Love the 2 liners ;) – Jose A Jul 18 '18 at 14:09
  • 2
    Watch out that if the categorical is ordered (an ordinal) then the numerical codes returned by `cat.codes` may NOT be the ones you see in the Series ! – paulperry Feb 07 '19 at 22:09
  • In this example `col1` and `col2` will be the same and this may NOT be what you want: ```df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':[5,4,3,2,1], 'col3':list('ababb')}) df.col1 = df.col1.astype('category', ordered=False) df.col2 = df.col2.astype(pd.api.types.CategoricalDtype(categories=[5,4,3,2,1], ordered=True))``` – paulperry Feb 07 '19 at 22:26
  • The code in my answer works exactly the same for ordered or non-ordered categoricals. – joris Feb 08 '19 at 07:23
  • I tried to convert directly one column with _df[''col2'] = _df['col2'].apply(lambda x: x.cat.codes) but I get the error `AttributeError: 'str' object has no attribute 'astype' ` even though the column is a categorical type. – Zioalex Jul 15 '20 at 16:11
82

This works for me:

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

Output:

[0, 1, 2, 0]
scottlittle
  • 18,866
  • 8
  • 51
  • 70
  • 5
    underrated answer – aryanknp Jan 07 '21 at 13:04
  • 7
    great, much simpler than the accepted answer – Moritz Jan 19 '21 at 18:12
  • 4
    I agree, this is a very good and efficient answer – Laurent Apr 09 '21 at 07:34
  • 4
    best answer, imho – Lécio Bourbon Dec 22 '21 at 12:48
  • 1
    While this solves the problem, you should prefer the accessor `pd.Series.cat.codes` over `pd.factorize` for performance reasons. Internally, a categorical is already a list of indices and extracting that via `.cat.codes` takes `O(0)` time, whereas (re-)factorizing via `pd.factorize` takes `O(n)` time ([currently](https://github.com/pandas-dev/pandas/blob/5d17d73be969a7d004a7e1035afbf811a8fbf18a/pandas/core/algorithms.py#L634-L812) no fast-path for categorical). – FirefoxMetzger Feb 08 '23 at 09:26
  • Then, nowadays what should we use, pd.factorize(), astype('category') or pd.Categorical()? – skan Jul 26 '23 at 11:56
26

If your concern was only that you making a extra column and deleting it later, just dun use a new column at the first place.

dataframe = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),  'col3':list('ababb')})
dataframe.col3 = pd.Categorical.from_array(dataframe.col3).codes

You are done. Now as Categorical.from_array is deprecated, use Categorical directly

dataframe.col3 = pd.Categorical(dataframe.col3).codes

If you also need the mapping back from index to label, there is even better way for the same

dataframe.col3, mapping_index = pd.Series(dataframe.col3).factorize()

check below

print(dataframe)
print(mapping_index.get_loc("c"))
Abhishek
  • 3,337
  • 4
  • 32
  • 51
17

Here multiple columns need to be converted. So, one approach i used is ..

for col_name in df.columns:
    if(df[col_name].dtype == 'object'):
        df[col_name]= df[col_name].astype('category')
        df[col_name] = df[col_name].cat.codes

This converts all string / object type columns to categorical. Then applies codes to each type of category.

shantanu pathak
  • 2,018
  • 19
  • 26
16

What I do is, I replace values.

Like this-

df['col'].replace(to_replace=['category_1', 'category_2', 'category_3'], value=[1, 2, 3], inplace=True)

In this way, if the col column has categorical values, they get replaced by the numerical values.

truth
  • 329
  • 3
  • 14
8

For converting categorical data in column C of dataset data, we need to do the following:

from sklearn.preprocessing import LabelEncoder 
labelencoder= LabelEncoder() #initializing an object of class LabelEncoder
data['C'] = labelencoder.fit_transform(data['C']) #fitting and transforming the desired categorical column.
8

To convert all the columns in the Dataframe to numerical data:

df2 = df2.apply(lambda x: pd.factorize(x)[0])
sauravjoshi23
  • 837
  • 11
  • 9
5

Answers here seem outdated. Pandas now has a factorize() function and you can create categories as:

df.col.factorize() 

Function signature:

pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)
Hamza
  • 5,373
  • 3
  • 28
  • 43
3

One of the simplest ways to convert the categorical variable into dummy/indicator variables is to use get_dummies provided by pandas. Say for example we have data in which sex is a categorical value (male & female) and you need to convert it into a dummy/indicator here is how to do it.

tranning_data = pd.read_csv("../titanic/train.csv")
features = ["Age", "Sex", ] //here sex is catagorical value
X_train = pd.get_dummies(tranning_data[features])
print(X_train)

Age Sex_female Sex_male
20    0          1
33    1          0
40    1          0
22    1          0
54    0          1
Hadi Mir
  • 4,497
  • 2
  • 29
  • 31
3

you can use .replace as the following:

df['col3']=df['col3'].replace(['B', 'C', 'E', 'G', 'H', 'N', 'S', 'W'],[1,2,3,4,5,6,7,8])

or .map:

df['col3']=df['col3'].map({1: 'B', 2: 'C', 3: 'E', 4:'G', 5:'H', 6:'N', 7:'S', 8:'W'})
Mingming Qiu
  • 333
  • 4
  • 9
3
categorical_columns =['sex','class','deck','alone']

for column in categorical_columns:
     df[column] = pd.factorize(df[column])[0]

Factorize will make each unique categorical data in a column into a specific number (from 0 to infinity).

irtexas19
  • 55
  • 6
1

@Quickbeam2k1 ,see below -

dataset=pd.read_csv('Data2.csv')
np.set_printoptions(threshold=np.nan)
X = dataset.iloc[:,:].values

Using sklearn enter image description here

from sklearn.preprocessing import LabelEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
gofr1
  • 15,741
  • 11
  • 42
  • 52
  • 3
    Why didn't you just correct your previous answer? Surprisingly, you are using `fit_transform` now instead of `transform_fit`and corrected the labelencoder definition. Why do you use `iloc[:,:]`? this is useless. What is the reason behind the image? In case you wanted to prove me and @theGtknerd wrond you failed. – Quickbeam2k1 Jul 31 '17 at 05:37
1

You can do it less code like below :

f = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),'col3':list('ababb')})

f['col1'] =f['col1'].astype('category').cat.codes
f['col2'] =f['col2'].astype('category').cat.codes
f['col3'] =f['col3'].astype('category').cat.codes

f

enter image description here

biddut
  • 353
  • 3
  • 6
1

Just use manual matching:

dict = {'Non-Travel':0, 'Travel_Rarely':1, 'Travel_Frequently':2}

df['BusinessTravel'] = df['BusinessTravel'].apply(lambda x: dict.get(x))
0

For a certain column, if you don't care about the ordering, use this

df['col1_num'] = df['col1'].apply(lambda x: np.where(df['col1'].unique()==x)[0][0])

If you care about the ordering, specify them as a list and use this

df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third'].index(x))
SaTa
  • 2,422
  • 2
  • 14
  • 26
0

you can use something like this

df['Grade'].replace(['A', 'B', 'C'], [0, 1, 2], inplace=True)

use the inplace argument if so that you dont perform a copy. you select a colume and replace the distinct there with the one you want.

Allaye
  • 833
  • 6
  • 18