0

I'm currently working on a model to predict a probability of fatality once a person is infected with the Corona virus. I'm using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc. It was suggested to use a decision tree, which I've already built. Since I'm new to decision trees I would like some assistance. I would like to have the prediction (target variable) expressed in a probability (%), not in a binary output. How can I achieve this? Also I want to play around with samples by inputting the data myself and see what the outcome is. For instance: let's take someone who is 40, male etc. and calculate what its survival chance is. How can I achieve this? I've attached the code below:

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
import pandas as pd
import random as rnd

filename = '/Users/sef/Downloads/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)

model = DecisionTreeClassifier()

model.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

rnd.seed(123458)
X_new = X[rnd.randrange(X.shape[0])]
X_new = X_new.reshape(1,8)
YHat = model.predict_proba(X_new)


df = pd.DataFrame(X_new, columns = names[:-1])
df["predicted"] = YHat
print(df)

Sef
  • 85
  • 7

3 Answers3

0

Decision Tree can also estimate the probability than an instance belongs to a particular class. Use predict_proba() as below with your train feature data to return the probability of various class you want to predict. model.predict() returns the class which has the highest probability

model.predict_proba()

Praks
  • 67
  • 1
  • 1
  • 4
  • Thanks Praks! However, I get the following error: ValueError: Wrong number of items passed 3, placement implies 1 – Sef Aug 13 '20 at 10:51
0

you can use the method "predict_proba" of the DecisionTreeClassifier to compute the probabilities instead of the binary classification values.

In order to test individual data, that you can create by hand, you have to create an array of the shape of your X_test data (just that it only has one entry). Then you can use that with model.predict(array) or model.predict_proba(array).

By the way, your tree is currently not useful for retrieving probabilities. There is an article that explains the problem very well: https://rpmcruz.github.io/machine%20learning/2018/02/09/probabilities-trees.html

So you can fix your code by defining the max_depths of your tree:

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
import pandas as pd
import random as rnd

filename = 'pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)

model = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

model.fit(X_train, Y_train)

rnd.seed(123458)
X_new = X[rnd.randrange(X.shape[0])]
X_new = X_new.reshape(1,8)
YHat = model.predict_proba(X_new)


df = pd.DataFrame(X_new, columns = names[:-1])
df["predicted"] = list(YHat)
print(df)
Kim Tang
  • 2,330
  • 2
  • 9
  • 34
  • I get the following error when using the predict_proba function, ValueError: Wrong number of items passed 3, placement implies 1 – Sef Aug 13 '20 at 10:52
  • Can you provide a reproducable example for debugging? – Kim Tang Aug 13 '20 at 10:53
  • after clearing the variables in the console and rerunning the code I get a different error: raise ValueError("Classification metrics can't handle a mix of {0} " ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets. What do you mean with a reproducable example? – Sef Aug 13 '20 at 11:32
  • With reproducable example, I mean code which I can run as well, to see the error and try to debug it. Right now you are providing your own file in the code, so it can not be executed for me. Have a look at this: https://stackoverflow.com/help/minimal-reproducible-example – Kim Tang Aug 13 '20 at 11:53
  • I've reproduced the error with another dataset, available at kaggle: https://www.kaggle.com/kumargh/pimaindiansdiabetescsv. The code above is limited to the error. – Sef Aug 13 '20 at 13:23
  • I've used another dataset which produces the same error. While troubleshooting I've found out what causes this error. It is this line of code: df = pd.DataFrame(X_new, columns = names[:-1]). It seems that predicted values with predict_proba can't be inserted in a dataframe. – Sef Aug 13 '20 at 13:44
  • You can fix it with df["predicted"] = list(YHat) for example in your code. – Kim Tang Aug 13 '20 at 13:50
  • 1
    casting it to a list does the trick. Thanks a lot Kim! – Sef Aug 13 '20 at 13:55
  • You're welcome. I added additional information for probabilities with decision trees in my answer (better formatting), because currently your probabilities are only 0. or 1.0, so not very useful. – Kim Tang Aug 13 '20 at 13:58
  • 1
    I see it, will dive into it. – Sef Aug 13 '20 at 14:06
0

Use the function called predict_proba model.predict_proba(X_test)

To the second part of your question, here is what you will have to do. Create your own custom dataset with the exact same column names as you had trained. Read your data from a csv and apply the same encoder values if any.

You can also save your label encoder object in a much more efficient way.

label = preprocessing.LabelEncoder() 
label_encoded_columns=['Date_statistics_type', 'Agegroup', 'Sex', 'Province', 'Hospital_admission', 'Municipal_health_service', 'Deceased']
for col in label_encoded_columns:
    dataframe[col] = dataframe[col].astype(str)
Label_Encoder = labelencoder.fit(dataframe[label_encoded_columns].values.flatten())
Encoded_Array = (Label_Encoder.transform(dataframe[label_encoded_columns].values.flatten())).reshape(dataframe[label_encoded_columns].shape)

LE_Dataframe=pd.DataFrame(Encoded_DataFrame,columns=label_encoded_columns,index=dataframe.index)
LE_mapping = dict(zip(Label_Encoder.classes_,Label_Encoder.transform(Label_Encoder.classes_).tolist()))
 #####This should give you dictionary in the form for all your list of values.
 ##### for eg: {'Apple':0,'Banana':1}

For your second part of the question, there can be two ways. The first one is pretty straightforward, where in you can use values of X_test to give you a resulting prediction. model.predict(X_test.iloc[0:30]) ###First 30 rows model.predict_proba(X_test.iloc[0:30])

In the second one, if you are talking about introducing new data, then in that case, you will have to label encode the raw data once again.

If that data is not present, it may give you never seen before values error.

Refer to this link in that case

  • Thank you, this makes it more clear! Trying to use the predict_proba function now. – Sef Aug 13 '20 at 13:28