How to reformat categorical Pandas variables for Sci-kit Learn

Question

Given a pandas dataFrame that looks like this:

|       | c_0337 | c_0348 | c_0351 | c_0364 |
|-------|:------:|-------:|--------|--------|
| id    |        |        |        |        |
| 11193 |    a   |      f | o      | a      |
| 11382 |    a   |      k | s      | a      |
| 16531 |    b   |      p | f      | b      |
| 1896  |    a   |      f | o      | NaN    |

I am trying to convert the categorical variables to numeric (preferably binary true false columns) I tried using the OneHotEncoder from scikit learn as follows:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([c4k.ix[:,'c_0327':'c_0351'].values])  
OneHotEncoder(categorical_features='all',
   n_values='auto', sparse=True)

That just gave me: invalid literal for long() with base 10: 'f'

I need to get the data into an array acceptable to Scikit learn, with columns being created with false for most entries (eg very sparse) true for the created column that contains the corresponding letter?

with NaN being 0=false

I suspect I'm way off here? Like not even using the right preprocessor?

Brand new at this so any pointers appreciated the actual dataset has over 1000 such columns...... So then I tried using DictVectorizer as follows:

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer() 
#fill df with zeros Since we don't want NaN
c4kNZ=c4k.ix[:,'c_0327':'c_0351'].fillna(0) 
#Make the dataFrame a Dict 
c4kb=c4kNZ.to_dict() 
sdata = vec.fit_transform(c4kb)

It gives me float() argument must be a string or a number – I rechecked the dict and it looks ok to me but I guess I have not gotten it formatted correctly?

You could use [DictVectoriser](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) — EdChum, Mar 26 '15 at 20:21
That seemed like the long way around,, but may that is it... — dartdog, Mar 26 '15 at 20:22
Could you explain your data bit more? Is it correct that you have 1000+ features, each of which may or may not be binary? If that's true, why do you want to force sparsity? — AGS, Mar 26 '15 at 22:47
Don't want to force sparsity, it is sparse as I understand the term...?? eg most values will me null/0 — dartdog, Mar 26 '15 at 22:58
Because you are building a model where each feature is forced to be binary, I guess my question is why must this be so in your example? I.E., why cant feature `c_0348` map to `[0,1,2]` vs `c_0348f=[1,0]`, `c_0348k=[1,0]`, `c_0348p=[1,0]` ? — AGS, Mar 26 '15 at 23:45

score 4 · Accepted Answer · answered Mar 26 '15 at 22:16

4

Is this what you are looking for?
It is using get_dummies to convert categorical columns into sparse dummy columns indicating the presence of a value:

In [12]: df = pd.DataFrame({'c_0337':list('aaba'), 'c_0348':list('fkpf')})

In [13]: df
Out[13]:
  c_0337 c_0348
0      a      f
1      a      k
2      b      p
3      a      f

In [14]: pd.get_dummies(df)
Out[14]:
   c_0337_a  c_0337_b  c_0348_f  c_0348_k  c_0348_p
0         1         0         1         0         0
1         1         0         0         1         0
2         0         1         0         0         1
3         1         0         1         0         0

answered Mar 26 '15 at 22:16

joris

133,120
36
247
202

The question is maybe how to combine the output of the different columns. There are some options for this, or you can also do it separately for each column and then combine the dataframes as you want. – joris Mar 26 '15 at 22:19
does exactly what I want.. Thanks! – dartdog Mar 27 '15 at 13:14

How to reformat categorical Pandas variables for Sci-kit Learn

1 Answers1

Linked