0

I have data of different types and I want to predict the dependent variable Y from variables A B and C shown below.

    Y       A       B     C
0   11.3914 2.75    0     [0, 0, 10, 17, 35, 26, 0]
1   14.0348 2.50    0     [0, 0, 39, 35, 30, 5, 0]  
2   14.8416 2.75    1     [0, 0, 12, 5, 5, 2, 1]
3   13.7829 2.25    0     [0, 0, 2, 18, 14, 8, 0]   
...

The following attempt gives me ValueError: setting an array element with a sequence. during the fit line.

X = df[['A', 'B', 'C']]
y = df['Y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
tree_reg = DecisionTreeRegressor()  
tree_reg.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

I assumed this was because of the array data in C but when I try to predict with only variables A and B: i.e. X = df[['A', 'B']]

I get another error, this time in the final predict line: ValueError: Number of features of the model must match the input. Model n_features is 7 and input n_features is 2

What am I doing wrong? How can I include each of these features in X?

rafvasq
  • 1,512
  • 3
  • 18
  • 48

1 Answers1

0

I think the error in case of using features A and B is due to the last line.

y_pred = regressor.predict(X_test)

It seems that you are using the wrong to predict. You have fit a model named tree_reg and are using another model regressor (maybe used for some previous data) to predict the results. In your case, regressor model accepts 7 feature, by you are providing only 2.

Error when using all the three features A, B and C

When you want to use a list inside a data frame, you cam make use of the tolist() method to convert the list to individual columns of the dataframe.

Split column of lists into multiple columns

skillsmuggler
  • 1,862
  • 1
  • 11
  • 16