train_test_split()
is intended to take your dataset and split it into two chunks, the training and testing sets. In your case, you already have the data split into two chunks, in separate csv files. You are then taking the train data and splitting it again into train
and val
, which is short for validation (essentially the test or verification data).
You probably want to do the model.fit
against your full training data set, and then call model.predict
again the test set. There shouldn't be a need to do the call to train_test_split()
.
Edit:
I may be wrong here. In looking at the competition page, I'm realizing that the test set does not include the ground truth values. You can't use that data to validate your model accuracy. In that case, I think splitting the original training dataset into training and validation makes sense. Since you're fitting the model only on the train portion, the validation set is still unseen for the model. Then you are using the known values from the validation set to verify the predictions of your model.
The test set would be just used to generate 'new' predictions, since you don't have the ground truth value to verify.
Edit (in response to comment):
I don't have these data sets, and haven't actually run this code, but I'd suggest something like the following. Essentially you want to do the same preparation of your test data as what you are doing with the training data, and then feed it into your model the same way that the validation set was fed in.
import ...
def get_dataset(path):
data = pd.read_csv(path)
data['Sex'] = pd.factorize(data.Sex)[0]
filtered_titanic_data = data.dropna(axis=0)
return filtered_titanic_data
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv"
train_data = get_dataset(train_path)
test_data = get_dataset(test_path)
columns_of_interest = ['Pclass', 'Sex', 'Age']
x = train_data[columns_of_interest]
y = train_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(val_predictions)
print(accuracy_score(val_y, val_predictions))
text_x = test_data[columns_of_interest]
test_predictions = titanic_model.predict(test_x)
(Also, note that I removed the Survived
column from the columns_of_interest
. I believe by including that column in your x
data, you were giving the model the value that it was attempting to predict, which is likely why you were getting 1.0 for the validation as well. You're giving it the answers to the test.)