scikit-learn - vectorizing both integer and string features at the same time

Question

Is there a way of applying one-hot coding to both strings and integers at the same time? DictVectorizer is used for strings, OneHotEncoder is used for integers. Is there something that kind of combines them (treat all feature values as categorical regardless of their type)?

For Example: I have a pandas DataFrame, some of the columns are integers and some are strings:

   >>> df
       a  b  c  d
    0  2  0  w  K
    1  0  1  f  K
    2  1  2  y  L
    3  0  0  f  M

All columns are actually categorical. There's no meaning for some of them being integers. Now if I use a DictVectorizer like this:

vectorizer = DictVectorizer(sparse=False)
df_dict = df.T.to_dict().values()
vectorizer.fit_transform(df_dict)

I get a nice big matrix for columns 'c' and 'd', but the values in 'a' and 'b' stay exactly the same. I need them to get the same action. One option is of course applying the str function on 'a' and 'b' but that's both implicit (the original data is always integers) and not efficient (iterating over all the column, which might be quite big and applying a wasteful task..).

Is there a simple way of doing this?

Thanks

You could just filter the columns on dtype: http://stackoverflow.com/questions/22697773/how-to-check-the-dtype-of-a-column-in-python-pandas — EdChum, Mar 10 '15 at 16:59
Thanks EdChum. It's a solution, but not so straightforward.. I'm looking for some kind of a general purpose "flattener", because that's actually what I'm doing. If I'd have no choice I'd write one myself. — nivniv, Mar 10 '15 at 17:05

score 0 · Answer 1 · answered Mar 10 '15 at 17:32

0

Looks like get_dummies is what you want. This will take any column and convert it into a pivot of categorical indicators.

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.get_dummies.html

answered Mar 10 '15 at 17:32

cwharland

6,275
3
22
29

scikit-learn - vectorizing both integer and string features at the same time

1 Answers1