I have data of different types and I want to predict the dependent variable Y
from variables A
B
and C
shown below.
Y A B C
0 11.3914 2.75 0 [0, 0, 10, 17, 35, 26, 0]
1 14.0348 2.50 0 [0, 0, 39, 35, 30, 5, 0]
2 14.8416 2.75 1 [0, 0, 12, 5, 5, 2, 1]
3 13.7829 2.25 0 [0, 0, 2, 18, 14, 8, 0]
...
The following attempt gives me ValueError: setting an array element with a sequence.
during the fit
line.
X = df[['A', 'B', 'C']]
y = df['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
I assumed this was because of the array data in C
but when I try to predict with only variables A
and B
:
i.e. X = df[['A', 'B']]
I get another error, this time in the final predict
line: ValueError: Number of features of the model must match the input. Model n_features is 7 and input n_features is 2
What am I doing wrong? How can I include each of these features in X
?