Pandas.get_dummies return to two columns(_Y and _N) instead of one

Question

I am trying to use sklearn to train a decision tree based on my dataset.

When I was trying to slicing the data to (outcome:Y, and predicting variables:X), it turns out that the outcome (my label) is in True/False:

#data slicing 
X = df.values[:,3:27] #X are the sets of predicting variable, dropping unique_id and student name here
Y = df.values[:,'OffTask'] #Y is our predicted value (outcome), it is in the 3rd column

This is how I do, but I do not know whether this is the right approach:

#convert the label "OffTask" to dummy 

df1 = pd.get_dummies(df,columns=["OffTask"])
df1

My trouble is the dataset df1 return my label Offtask to OffTask_N and OffTask_Y

Can someone know how to fix it?

Possible duplicate of [How can I map True/False to 1/0 in a Pandas DataFrame?](https://stackoverflow.com/questions/17383094/how-can-i-map-true-false-to-1-0-in-a-pandas-dataframe) — mkrieger1, Feb 11 '19 at 21:48
I don't think it is the same question. I am not sure how to convert the list and reuse the list. — WY G, Feb 11 '19 at 21:51
sklearn can take True/False as a y vector and do the fitting just fine, there really is no need for you to convert. But if you really insist on seeing 0 and 1 you can use `df['OffTask'] = df['OffTask'].astype(int)` — Tacratis, Feb 11 '19 at 22:14

score 1 · Answer 1 · answered Feb 12 '19 at 09:10

1

get_dummies is used for converting nominal string values to integer. It returns as many as column as many unique string values are available in columns eg:

df={'color':['red','green','blue'],'price':[1200,3000,2500]}
my_df=pd.DataFrame(df)
pd.get_dummies(my_df)

In your case you can drop first value, wherever value is null can be considered it will be first value

answered Feb 12 '19 at 09:10

Pradeep Pandey

307
2
7

Hi, thanks. This is what I did. I just drop the _N in this case, but I am just wondering whether there is a better way to do that – WY G Feb 12 '19 at 15:10

score 0 · Answer 2 · answered Feb 12 '19 at 07:21

0

You could make the pd.get_dummies to return only one column by setting drop_first=True

y = pd.get_dummies(df,columns=["OffTask"], drop_first=True)

But this is not the recommended way to convert the label to binaries. I would suggest using labelbinarizer for this purpose.

Example:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit_transform(pd.DataFrame({'OffTask':['yes', 'no', 'no', 'yes']}))

#
array([[1],
       [0],
       [0],
       [1]])

answered Feb 12 '19 at 07:21

Venkatachalam

16,288
9
49
77

Hi, thank you for the reply. I am still a little confused how the prepocessing has convert the list to binary in this case? How could it be return to my dataset? – WY G Feb 12 '19 at 15:07
It will create a dummy variable for each unique value in a `list ` / `pd.Series`. Then dummy variables will be 1, if the corresponding element is belongs to that value. – Venkatachalam Feb 13 '19 at 06:59
Go through through link for detailed explanation. https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets – Venkatachalam Feb 13 '19 at 07:01

Pandas.get_dummies return to two columns(_Y and _N) instead of one

2 Answers2