-1

my data in a csv file :

cucumber,green,15,4
tomato,red,7,7
carrots,Orange,13,3
onion,White,8,8
potatoes,Gray,8,6
apple,Red,7,6
apple,Yellow,6,5
coconut,Brown,25,20
orange,Orange,7,7
banana,Yellow,16,4
lemon,Yellow,5,4
watermelon,Green,30,25
cherries,Black,2,2

and i want predict a friut !

    import csv
    from sklearn import tree

    x = []
    y = []
    lst = []

    with open('F5-ML-TEST.csv', 'r') as csvfile:
        data = csv.reader(csvfile)
        for line in data:
            lst.append(line[1])
            lst.append(line[2])
            lst.append(line[3])
            x.append(lst)
            y.append(line[0])
            lst = []

    print('x ----- >', x)
    print('y ----- >', y)

    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(x, y)

    new_data = [["red", 7, 7], ["yellow", 5, 6]]
    answer = clf.predict(new_data)

    print('answer[0]====== >', answer[0])
    print('answer[1]====== >', answer[1])

  • you can only predict numbers with numbers. so sklearn doesn't understand how 'cucumber' predicts something. you need to onehot encode or something – Nicolas Gervais Apr 25 '20 at 23:57
  • Possibly duplicate, @RezaFiouji have a look here https://stackoverflow.com/q/38108832/4476612 – Rushikesh Gaidhani Apr 26 '20 at 00:04
  • Does this answer your question? [Passing categorical data to Sklearn Decision Tree](https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree) – Rushikesh Gaidhani Apr 26 '20 at 00:05

1 Answers1

1

So what you need to do is encode your string data into numeric features. Here I'm copying your input:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier

df = pd.read_clipboard(header=None, sep=',')
             0       1   2   3
0     cucumber   green  15   4
1       tomato     red   7   7
2      carrots  Orange  13   3
3        onion   White   8   8
4     potatoes    Gray   8   6
5        apple     Red   7   6
6        apple  Yellow   6   5

You will need to encode the "color" column:

ohe = OneHotEncoder(sparse=False)

colors = ohe.fit_transform(df.iloc[:, 1].values.reshape(-1, 1))

Now it looks like this, every color is a column:

array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], ...

Then you need to concatenate it with the other columns, which are already numeric:

inputs = np.concatenate([df.iloc[:, 2:].values, colors], axis=1)

Now, you need to turn your targets (the fruit) into numbers:

oe = OrdinalEncoder()

targets = oe.fit_transform(df.iloc[:, 0].values.reshape(-1, 1))

Now, they look like this:

array([[ 5.],
       [10.],
       [ 2.],
       [ 7.],
       [ 9.],
       [ 0.], ...

Then, you can fit your decision tree:

clf = DecisionTreeClassifier()
clf = clf.fit(inputs, targets)

And now you can even predict new data:

new_data = [["red", 7, 7], ["Yellow", 5, 6]]

new_data = np.concatenate([[i[1:] for i in new_data],
    ohe.transform([[i[0]] for i in new_data])], axis=1)

answer = clf.predict(new_data)
oe.categories_[0][answer.astype(int)]
Out[88]: array(['tomato', 'apple'], dtype=object)
Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143