69

In pandas, how can I convert a column of a DataFrame into dtype object? Or better yet, into a factor? (For those who speak R, in Python, how do I as.factor()?)

Also, what's the difference between pandas.Factor and pandas.Categorical?

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
N. McA.
  • 4,796
  • 4
  • 35
  • 60

3 Answers3

96

You can use the astype method to cast a Series (one column):

df['col_name'] = df['col_name'].astype(object)

Or the entire DataFrame:

df = df.astype(object)

Update

Since version 0.15, you can use the category datatype in a Series/column:

df['col_name'] = df['col_name'].astype('category')

Note: pd.Factor was been deprecated and has been removed in favor of pd.Categorical.

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thank you soo much, that was becoming a massive headache. – N. McA. Mar 30 '13 at 22:09
  • When Trying this I am getting "TypeError: data type not understood" I am trying this with both data['engagement'] = data['engagement'].astype(data) AND data = data.astype(data). My column is engagement 5000 non-null float64 – billmanH Apr 30 '15 at 18:19
  • You need to use object? `data['engagement'].astype(object)`... If they are already floats why would you want to change to object? – Andy Hayden Apr 30 '15 at 23:35
  • Note: Also that when this original answer was written creating a categorical then setting it to a column, the column was converted to object (or another dtype), as you couldn't (until 0.15) have categorical columns/Series. – Andy Hayden Oct 21 '15 at 19:54
17

There's also pd.factorize function to use:

# use the df data from @herrfz

In [150]: pd.factorize(df.b)
Out[150]: (array([0, 1, 0, 1, 2]), array(['yes', 'no', 'absent'], dtype=object))
In [152]: df['c'] = pd.factorize(df.b)[0]

In [153]: df
Out[153]: 
   a       b  c
0  1     yes  0
1  2      no  1
2  3     yes  0
3  4      no  1
4  5  absent  2
piggybox
  • 1,689
  • 1
  • 15
  • 19
12

Factor and Categorical are the same, as far as I know. I think it was initially called Factor, and then changed to Categorical. To convert to Categorical maybe you can use pandas.Categorical.from_array, something like this:

In [27]: df = pd.DataFrame({'a' : [1, 2, 3, 4, 5], 'b' : ['yes', 'no', 'yes', 'no', 'absent']})

In [28]: df
Out[28]: 
   a       b
0  1     yes
1  2      no
2  3     yes
3  4      no
4  5  absent

In [29]: df['c'] = pd.Categorical.from_array(df.b).labels

In [30]: df
Out[30]: 
   a       b  c
0  1     yes  2
1  2      no  1
2  3     yes  2
3  4      no  1
4  5  absent  0
herrfz
  • 4,814
  • 4
  • 26
  • 37
  • 2
    be noted that above usage have been deprecated, and need to use as below: `pd.Categorical(df.b).codes` – Jinstrong Feb 26 '18 at 07:30