Scikit-learn's DecisionTreeClassifier's fit method gives ValueError: Couldn't broadcast input array from shape (10,35) into shape (10)

Question

So I'm trying to make a decision tree and my target is array [0, 1] (binary 'NO' or 'YES') and my input training_set is three dimensional array with first elements all 'NO' examples (10) and with 35 features each and same with 'Yes'. but I keep getting this error.

    file1 = open(file1.txt) # examples of 'No' class
    file2 = open(file2.txt) # examples of 'Yes' class
    x = vectorizer.fit_transform(file1)
    y = vectorizer.fit_transform(file2)    

    x_array = x.toarray()    
    y_array = y.toarray()    


    x_train, x_test, y_train, y_test = train_test_split(x_array, y_array, 
    test_size=0.2)    
    target = [0, 1] # 0 encoded as 'No' and 1 as 'Yes
    train = [x_train, y_train]

    decisiontree = DecisionTreeClassifier(random_state=0, max_depth=5)
    decisiontree = decisiontree.fit(train, target)

Thanks for help.

Edit: I am loading data from a txt file and it is text data, I have tried printing some part of array and here it is

[[0 0 0 ... 0 0 0]    
 [0 0 0 ... 0 0 0]     
 [0 0 0 ... 0 0 0]     
 [0 0 0 ... 0 0 0]]

Upload more part of your code which shows your train and target arrays. Also a sample of your dataset — Imanpal Singh, Feb 03 '20 at 03:06
This might not be the only issue but `decisiontree = decisiontree.fit(train, target)` should be `decisiontree = decisiontree.fit(x_train, y_train)`. — Max Power, Feb 03 '20 at 04:21
@Max Power x_train and y_train are examples of class 'No' and 'Yes'. Shouldn't the target be [0, 1], with 0 encoded as 'No' and 1 as 'Yes' ? — Pure Evil, Feb 03 '20 at 04:31
if the `target` is `[0,1]` and you pass that as the second param to the `fit` method, sklearn interprets that as "there are two records/rows we're training on, the first has ground-truth value of 0 and the second has ground-truth value of 1`. But I'm pretty sure that's not what you want, since you seem to have more than two rows of data. Also the first param you pass to `fit` should be a single 2-d array of shape (num_records, num_features), not a list of two different arrays — Max Power, Feb 03 '20 at 14:27
As a more general piece of advice though, to get best results on stack overflow you should post a minimal, reproducible example that includes code that creates some sample data to run the code you're trying to debug. No one here has your `file1.txt` so no one can actually run your code to iterate on it until it actually works. For an example of how to write a good complete minimal example with sample data, see here: https://stackoverflow.com/a/43298736/1870832 — Max Power, Feb 03 '20 at 14:30

score 0 · Answer 1 · answered Feb 03 '20 at 05:15

I think the reason is your confusion with the fit method in decisiontree.fit.

For decisiontree.fit(X,Y), it expects X to be the data points and Y to be the labels. That is, if X has the shape N x 32 then Y should have the shape N (where N is the number of data points).

You should combine x_array and y_array as the entire data set, split it, and perform fit with the corresponding labels.

Consider the following:

# from sklearn.model_selection import train_test_split
# from sklearn.tree import DecisionTreeClassifier
import numpy as np

file1 = open(file1.txt)
file2 = open(file2.txt)
x = vectorizer.fit_transform(file1)
y = vectorizer.fit_transform(file2)    

x_array = x.toarray()    
y_array = y.toarray()

# ------------------------------------------------------------
# combine the positive and negative examples
data = np.concatenate([x_array, y_array], axis=0)
# create corresponding labels (based on the data's length)
labels = np.concatenate([np.zeros(x_array.shape[0]), 
                          np.ones(y_array.shape[0])], axis=0)

# split into train and test set
train_data, test_data, train_labels, test_labels = train_test_split(
    data, labels, test_size=0.2)

decisiontree = DecisionTreeClassifier(random_state=0, max_depth=5)
decisiontree = decisiontree.fit(train_data, train_labels)

# ------------------------------------------------------------
# this is how you can test model performance with the test set
correct_predictions = np.count_nonzero(
    decisiontree.predict(test_data) == test_labels
  )

print("Correct prediction in test set: {}/{}".format(correct_predictions,
                                                       test_labels.shape[0]))

Scikit-learn's DecisionTreeClassifier's fit method gives ValueError: Couldn't broadcast input array from shape (10,35) into shape (10)

1 Answers1