Pandas cast all object columns to category

Question

I want to have ha elegant function to cast all object columns in a pandas data frame to categories

df[x] = df[x].astype("category") performs the type cast df.select_dtypes(include=['object']) would sub-select all categories columns. However this results in a loss of the other columns / a manual merge is required. Is there a solution which "just works in place" or does not require a manual cast?

edit

I am looking for something similar as http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html for a conversion to categorical data

piRSquared · Accepted Answer · 2016-10-07T05:39:25.750

11

use apply and pd.Series.astype with dtype='category'

Consider the pd.DataFrame df

df = pd.DataFrame(dict(
        A=[1, 2, 3, 4],
        B=list('abcd'),
        C=[2, 3, 4, 5],
        D=list('defg')
    ))
df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null int64
B    4 non-null object
C    4 non-null int64
D    4 non-null object
dtypes: int64(2), object(2)
memory usage: 200.0+ bytes

Lets use select_dtypes to include all 'object' types to convert and recombine with a select_dtypes to exclude them.

df = pd.concat([
        df.select_dtypes([], ['object']),
        df.select_dtypes(['object']).apply(pd.Series.astype, dtype='category')
        ], axis=1).reindex_axis(df.columns, axis=1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null int64
B    4 non-null category
C    4 non-null int64
D    4 non-null category
dtypes: category(2), int64(2)
memory usage: 208.0 bytes

edited Oct 07 '16 at 05:39

answered Oct 06 '16 at 22:21

piRSquared

285,575
57
475
624

Indeed this is a great start. But I only want to convert object dtype and not float or integer as your solution "brute-forcely" converts anything to category – Georg Heiler Oct 07 '16 at 04:34
This: df.select_dtypes(include=['object']).apply(pd.Series.astype, dtype='category').info() partially works e.g. all objects are converted. But afterwards manually a merge with the numeric columns needs to be performed. How can I prevent this and selectively change the dtypes in place – Georg Heiler Oct 07 '16 at 04:40
Is there maybe a more efficient way? – Benni Dec 02 '20 at 09:58

score 7 · Answer 2 · answered Jul 09 '19 at 03:35

I think that this is a more elegant way:

df = pd.DataFrame(dict(
        A=[1, 2, 3, 4],
        B=list('abcd'),
        C=[2, 3, 4, 5],
        D=list('defg')
    ))

df.info()

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

df.info()

score 2 · Answer 3 · answered Oct 28 '19 at 18:30

2

Wish I could add this as a comment, but can't.

The accepted answer doesn't work for pandas version 0.25 and higher. Use .reindex instead of reindex_axis. See here for more information: https://github.com/scikit-hep/root_pandas/issues/82

answered Oct 28 '19 at 18:30

a Data Head

51
4

score 0 · Answer 4 · answered Mar 07 '19 at 09:33

Often the order of categories has meaning, for example t-short sizes 'S', 'M', 'L' 'XL' are ordered categories (in SPSS - ordinals). If you are interested in creating ordered categories from strings you can use this code:

df = pd.concat([
        df.select_dtypes([], ['object']),
        df.select_dtypes(['object']).apply(pd.Categorical, ordered=True)
        ], axis=1).reindex(df.columns, axis=1)

In the resulting DataFrame categorical columns can be sorted by values the same way as you used to sort strings.

Pandas cast all object columns to category

edit

4 Answers4

Wish I could add this as a comment, but can't.

Linked