4

I am trying to use sklearn to train a decision tree based on my dataset.

When I was trying to slicing the data to (outcome:Y, and predicting variables:X), it turns out that the outcome (my label) is in True/False:

#data slicing 
X = df.values[:,3:27] #X are the sets of predicting variable, dropping unique_id and student name here
Y = df.values[:,'OffTask'] #Y is our predicted value (outcome), it is in the 3rd column 

This is how I do, but I do not know whether this is the right approach:

#convert the label "OffTask" to dummy 

df1 = pd.get_dummies(df,columns=["OffTask"])
df1

My trouble is the dataset df1 return my label Offtask to OffTask_N and OffTask_Y

Can someone know how to fix it?

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
WY G
  • 129
  • 10
  • Is this about pandas? – mkrieger1 Feb 11 '19 at 21:47
  • 1
    Possible duplicate of [How can I map True/False to 1/0 in a Pandas DataFrame?](https://stackoverflow.com/questions/17383094/how-can-i-map-true-false-to-1-0-in-a-pandas-dataframe) – mkrieger1 Feb 11 '19 at 21:48
  • I don't think it is the same question. I am not sure how to convert the list and reuse the list. – WY G Feb 11 '19 at 21:51
  • sklearn can take True/False as a y vector and do the fitting just fine, there really is no need for you to convert. But if you really insist on seeing 0 and 1 you can use `df['OffTask'] = df['OffTask'].astype(int)` – Tacratis Feb 11 '19 at 22:14

2 Answers2

1

get_dummies is used for converting nominal string values to integer. It returns as many as column as many unique string values are available in columns eg:

df={'color':['red','green','blue'],'price':[1200,3000,2500]}
my_df=pd.DataFrame(df)
pd.get_dummies(my_df)

In your case you can drop first value, wherever value is null can be considered it will be first value

Pradeep Pandey
  • 307
  • 2
  • 7
  • Hi, thanks. This is what I did. I just drop the _N in this case, but I am just wondering whether there is a better way to do that – WY G Feb 12 '19 at 15:10
0

You could make the pd.get_dummies to return only one column by setting drop_first=True

y = pd.get_dummies(df,columns=["OffTask"], drop_first=True)

But this is not the recommended way to convert the label to binaries. I would suggest using labelbinarizer for this purpose.

Example:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit_transform(pd.DataFrame({'OffTask':['yes', 'no', 'no', 'yes']}))

#
array([[1],
       [0],
       [0],
       [1]])
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • Hi, thank you for the reply. I am still a little confused how the prepocessing has convert the list to binary in this case? How could it be return to my dataset? – WY G Feb 12 '19 at 15:07
  • It will create a dummy variable for each unique value in a `list ` / `pd.Series`. Then dummy variables will be 1, if the corresponding element is belongs to that value. – Venkatachalam Feb 13 '19 at 06:59
  • Go through through link for detailed explanation. https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets – Venkatachalam Feb 13 '19 at 07:01