3

I have been working on a basic DecisionTree Classifier and I need my model to ask for question at each node. Basically my disease predictor should guess disease on basis of symptoms told by user. SO I want to ask user at each stage if they have the specific symptom(splitting at node) and use it to predict the output.

In detail, here is my current code snippit:

import pandas as pd
import numpy as np
from sklearn import tree
..
..
#import data from db and store in variables
..
clf = tree.DecisionTreeClassifier(criterion='entropy', splitter='best')
clf = clf.fit(relations,diseaseCodes)

print(clf.predict([relations[10]]))

Here I have to supply complete list of all the values in single go. I want to ask my user question at each step like which symptom do you have now,and on basis of it classify the disease.

NOTE:: I know my decision tree is overfitted.

Rishabh Ryber
  • 446
  • 1
  • 7
  • 21
  • That's not how sklearn models work. Better architecture is to ask all of the symptom questions at once, then pass that array into `predict`. If you want the prediction to change with each question, then that would be a custom object or set of functions, and too broad for stack overflow without a [mcve] for what you've actually tried – G. Anderson Jan 15 '20 at 19:45
  • 2
    @G.Anderson I understand your concern buddy but the problem can be solved in multiple ways and I can't ask user about all 400 symptoms I have at a time – Rishabh Ryber Jan 15 '20 at 19:49
  • 2
    If "the problem can be solved in multiple ways" then please show the code for what you've tried so far and how your results differ from the expected results, so that we have the minimal example requested to be able to help you in a meaningful way – G. Anderson Jan 15 '20 at 20:37

1 Answers1

1

For this, you can manually traverse the fitted tree, accessing properties not available through public api.

First, let's get a fitted tree, using the "iris" dataset:

import numpy as np # linear algebra
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

data = load_iris()
clf = DecisionTreeClassifier(max_depth=3).fit(data['data'],data['target'])

Let's visualize this tree, primarily to debug our final program:

plt.figure(figsize=(10,8))
plot_tree(clf,feature_names=data['feature_names'],class_names=data['target_names'],filled=True);

Which outputs in my case: enter image description here

Now the main part. From this link, we know that-

The binary tree "tree_" is represented as a number of parallel arrays. The i-th element of each array holds information about the node i.

The arrays that we need are feature,value, threshold and two children_*. So, starting from root (i=0), we first collect the feature and threshold for each node we visit, ask the user for value of that particular feature, and traverse left or right by comparing given value with threshold. When we reach a leaf, we find the most frequent class in that leaf, and that ends our loop.

tree = clf.tree_
node = 0      #Index of root node
while True:
    feat,thres = tree.feature[node],tree.threshold[node]
    print(feat,thres)
    v = float(input(f"The value of {data['feature_names'][feat]}: "))
    if v<=thres:
        node = tree.children_left[node]
    else:
        node = tree.children_right[node]
    if tree.children_left[node] == tree.children_right[node]: #Check for leaf
        label = np.argmax(tree.value[node])
        print("We've reached a leaf")
        print(f"Predicted Label is: {data['target_names'][label]}")
        break

An example of such a run for above tree is:

3 0.800000011920929
The value of petal width (cm): 1
3 1.75
The value of petal width (cm): 1.5
2 4.950000047683716
The value of petal length (cm): 5.96
We've reached a leaf
Predicted Label is: virginica
Shihab Shahriar Khan
  • 4,930
  • 1
  • 18
  • 26