I have tried to train a decision tree classifier with the dataset data.csv which contains 1500 datapoints and 107 columns with Column 107 as the target, and test the classifier on the dataset data_test.csv which contains 917 datapoints with 107 columns with Column 107 as the target. This is the code I have written
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix
test_data = pd.read_csv("data_test.csv")
Data = pd.read_csv("data.csv")
Data = Data.fillna(0)
test_data = test_data.fillna(0)
Data.head(10)
Data.shape
Data.describe()
Data.info()
X = Data.iloc[:, 0:106]
y = Data["Target (Col 107)"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))
#print(dTreeR.score(X_test , y_test))
y_predict = dTreeR.predict(test_data)
cm = confusion_matrix(y_test, y_predict, labels=[0, 1])
print(cm)
print(classification_report(y_test, y_predict))
And after running this code, it gives the following error when executing the y_predict line
ValueError: Found input variables with inconsistent numbers of samples: [450, 917]
Please let me know where I am going wrong.
I also wanted to know how to export the prediction results of the decision tree to a csv file
Thanks!