How to use two different datasets for training and testing a Decision Tree

Question

I have tried to train a decision tree classifier with the dataset data.csv which contains 1500 datapoints and 107 columns with Column 107 as the target, and test the classifier on the dataset data_test.csv which contains 917 datapoints with 107 columns with Column 107 as the target. This is the code I have written

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,confusion_matrix

test_data = pd.read_csv("data_test.csv")
Data = pd.read_csv("data.csv")
Data = Data.fillna(0)
test_data = test_data.fillna(0)
Data.head(10)
Data.shape
Data.describe()
Data.info()
X = Data.iloc[:, 0:106]
y = Data["Target (Col 107)"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)

print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))

dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))

#print(dTreeR.score(X_test , y_test))
y_predict = dTreeR.predict(test_data)

cm = confusion_matrix(y_test, y_predict, labels=[0, 1])
print(cm)
print(classification_report(y_test, y_predict))

And after running this code, it gives the following error when executing the y_predict line

ValueError: Found input variables with inconsistent numbers of samples: [450, 917]

Please let me know where I am going wrong.

I also wanted to know how to export the prediction results of the decision tree to a csv file

Thanks!

check sizes of your variables. test_data does not have 107 columns — lejlot, Oct 22 '22 at 19:53
Test_data and data both have 107 columns, they both have different number of rows — Revanth Kumar Perla, Oct 23 '22 at 04:53
You can see how to output the tree as a .csv file here: https://stackoverflow.com/questions/76712060/how-to-convert-a-fitted-scikit-learn-decision-tree-model-to-a-tabular-format — Zain, Aug 02 '23 at 14:58

How to use two different datasets for training and testing a Decision Tree

0 Answers0