0

So I'm trying to make a decision tree and my target is array [0, 1] (binary 'NO' or 'YES') and my input training_set is three dimensional array with first elements all 'NO' examples (10) and with 35 features each and same with 'Yes'. but I keep getting this error.

    file1 = open(file1.txt) # examples of 'No' class
    file2 = open(file2.txt) # examples of 'Yes' class
    x = vectorizer.fit_transform(file1)
    y = vectorizer.fit_transform(file2)    

    x_array = x.toarray()    
    y_array = y.toarray()    


    x_train, x_test, y_train, y_test = train_test_split(x_array, y_array, 
    test_size=0.2)    
    target = [0, 1] # 0 encoded as 'No' and 1 as 'Yes
    train = [x_train, y_train]

    decisiontree = DecisionTreeClassifier(random_state=0, max_depth=5)
    decisiontree = decisiontree.fit(train, target)    

Thanks for help.

Edit: I am loading data from a txt file and it is text data, I have tried printing some part of array and here it is

[[0 0 0 ... 0 0 0]    
 [0 0 0 ... 0 0 0]     
 [0 0 0 ... 0 0 0]     
 [0 0 0 ... 0 0 0]]    
Pure Evil
  • 1
  • 2
  • Upload more part of your code which shows your train and target arrays. Also a sample of your dataset – Imanpal Singh Feb 03 '20 at 03:06
  • 1
    This might not be the only issue but `decisiontree = decisiontree.fit(train, target)` should be `decisiontree = decisiontree.fit(x_train, y_train)`. – Max Power Feb 03 '20 at 04:21
  • @Max Power x_train and y_train are examples of class 'No' and 'Yes'. Shouldn't the target be [0, 1], with 0 encoded as 'No' and 1 as 'Yes' ? – Pure Evil Feb 03 '20 at 04:31
  • if the `target` is `[0,1]` and you pass that as the second param to the `fit` method, sklearn interprets that as "there are two records/rows we're training on, the first has ground-truth value of 0 and the second has ground-truth value of 1`. But I'm pretty sure that's not what you want, since you seem to have more than two rows of data. Also the first param you pass to `fit` should be a single 2-d array of shape (num_records, num_features), not a list of two different arrays – Max Power Feb 03 '20 at 14:27
  • As a more general piece of advice though, to get best results on stack overflow you should post a minimal, reproducible example that includes code that creates some sample data to run the code you're trying to debug. No one here has your `file1.txt` so no one can actually run your code to iterate on it until it actually works. For an example of how to write a good complete minimal example with sample data, see here: https://stackoverflow.com/a/43298736/1870832 – Max Power Feb 03 '20 at 14:30

1 Answers1

0

I think the reason is your confusion with the fit method in decisiontree.fit.

For decisiontree.fit(X,Y), it expects X to be the data points and Y to be the labels. That is, if X has the shape N x 32 then Y should have the shape N (where N is the number of data points).

You should combine x_array and y_array as the entire data set, split it, and perform fit with the corresponding labels.

Consider the following:

# from sklearn.model_selection import train_test_split
# from sklearn.tree import DecisionTreeClassifier
import numpy as np

file1 = open(file1.txt)
file2 = open(file2.txt)
x = vectorizer.fit_transform(file1)
y = vectorizer.fit_transform(file2)    

x_array = x.toarray()    
y_array = y.toarray()

# ------------------------------------------------------------
# combine the positive and negative examples
data = np.concatenate([x_array, y_array], axis=0)
# create corresponding labels (based on the data's length)
labels = np.concatenate([np.zeros(x_array.shape[0]), 
                          np.ones(y_array.shape[0])], axis=0)

# split into train and test set
train_data, test_data, train_labels, test_labels = train_test_split(
    data, labels, test_size=0.2)

decisiontree = DecisionTreeClassifier(random_state=0, max_depth=5)
decisiontree = decisiontree.fit(train_data, train_labels)

# ------------------------------------------------------------
# this is how you can test model performance with the test set
correct_predictions = np.count_nonzero(
    decisiontree.predict(test_data) == test_labels
  )

print("Correct prediction in test set: {}/{}".format(correct_predictions,
                                                       test_labels.shape[0]))
Tin Lai
  • 440
  • 3
  • 8