7

I have a dataset which has a few columns with categorical data.

I've been using the Categorical function to replace categorical values with numerical ones.

data[column] = pd.Categorical.from_array(data[column]).codes

I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other?

sapo_cosmico
  • 6,274
  • 12
  • 45
  • 58
  • 1
    If you just want to convert to numeric values for sklearn why not [DictVectoriser](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)? – EdChum Mar 24 '15 at 08:28
  • 2
    To be honest, Ed, because I didn't know it existed :) – sapo_cosmico Mar 24 '15 at 22:11
  • You'll probably find that sklearn has most of your data preprocessing needs – EdChum Mar 24 '15 at 22:13

1 Answers1

6

Why are you converting the categorical datas to integers? I don't believe you save memory if that is your goal.

df = pd.DataFrame({'cat': pd.Categorical(['a', 'a', 'a', 'b', 'b', 'c'])})
df2 = pd.DataFrame({'cat': [1, 1, 1, 2, 2, 3]})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null category
dtypes: category(1)
memory usage: 78.0 bytes

>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 1 columns):
cat    6 non-null int64
dtypes: int64(1)
memory usage: 96.0 bytes

The categorical codes are just integer values for the unique items in the given category. By contrast, get_dummies returns a new column for each unique item. The value in the column indicates whether or not the record has that attribute.

>>> pd.core.reshape.get_dummies(df)
Out[30]: 
   cat_a  cat_b  cat_c
0      1      0      0
1      1      0      0
2      1      0      0
3      0      1      0
4      0      1      0
5      0      0      1

To get the codes directly, you can use:

df['codes'] = [df.cat.codes.to_list()]
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • 1
    Thanks Alexander, I'm actually preparing the dataset for a Random Forest regression, so I need everything to be numerical. It actually turns out that get_dummies will give me memory errors, whereas Categorical will not – sapo_cosmico Mar 24 '15 at 00:25
  • 1
    This is not an answer to the second part of the question, which was the key part I guess: I've recently ran across the pandas.get_dummies function. Are these interchangeable? Is there an advantage of using one over the other? – Geeocode Nov 03 '15 at 22:29
  • The second part of the question isn't a programming question. A machine learning algorithm will interpret categorical data in `df2` as having order (e.g. green is greater than red). Whether or not this desirable depends on your use case. To get around this issue, dummy variables (aka One-Hot-Encoding) create new features for each of the categorical items. – Alexander Nov 06 '15 at 17:39
  • @sapo_cosmico In reference to the memory error, you can use the `sparse=True` option, which shouldn't use much more memory than the original categorical dataframe. – T.C. Proctor Jun 05 '16 at 19:24
  • @Alexander As for the machine learning question, it seems that random forest is generally able to create trees that can ignore the implied order, though I haven't seen any rigorous proofs of it. It seems to be common practice, though. – T.C. Proctor Jun 05 '16 at 19:25