How to use the test data against the trained model?

Question

I'm a beginner in Machine Learning and I'm going through the Titanic competition. At first, my model gave me an accuracy of 1.0, which was too good to be true. Then I realized that I am comparing my trained model with the training data that I've used to train it and that my test data was nowhere to be found. This is why I think it gave me such an absurd number.

The following is my code:

import ...

train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv"

train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

train_data['Sex'] = pd.factorize(train_data.Sex)[0]

columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)

x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)

val_predictions = titanic_model.predict(val_x)

print(val_predictions)
print(accuracy_score(val_y, val_predictions))

I know that val_predictions need to have something to do with my test data but I'm not sure how to implement that.

The competition leaderboard has 25 people getting a score of 1.0 on 50% of the test data. Are you sure that the 1.0 score is unreasonable? Perhaps the data and the model fit nicely and you actually get accurate predictions. — bzier, Jul 21 '18 at 01:47
Info regarding training, validation, and test datasets: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7 — bzier, Jul 21 '18 at 01:49

bzier · Accepted Answer · 2018-07-22T04:03:33.860

train_test_split() is intended to take your dataset and split it into two chunks, the training and testing sets. In your case, you already have the data split into two chunks, in separate csv files. You are then taking the train data and splitting it again into train and val, which is short for validation (essentially the test or verification data).

You probably want to do the model.fit against your full training data set, and then call model.predict again the test set. There shouldn't be a need to do the call to train_test_split().

Edit:

I may be wrong here. In looking at the competition page, I'm realizing that the test set does not include the ground truth values. You can't use that data to validate your model accuracy. In that case, I think splitting the original training dataset into training and validation makes sense. Since you're fitting the model only on the train portion, the validation set is still unseen for the model. Then you are using the known values from the validation set to verify the predictions of your model.

The test set would be just used to generate 'new' predictions, since you don't have the ground truth value to verify.

Edit (in response to comment):

I don't have these data sets, and haven't actually run this code, but I'd suggest something like the following. Essentially you want to do the same preparation of your test data as what you are doing with the training data, and then feed it into your model the same way that the validation set was fed in.

import ...

def get_dataset(path):
    data = pd.read_csv(path)

    data['Sex'] = pd.factorize(data.Sex)[0]

    filtered_titanic_data = data.dropna(axis=0)

    return filtered_titanic_data

train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
test_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\test.csv"

train_data = get_dataset(train_path)
test_data = get_dataset(test_path)

columns_of_interest = ['Pclass', 'Sex', 'Age']

x = train_data[columns_of_interest]
y = train_data.Survived

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)

val_predictions = titanic_model.predict(val_x)

print(val_predictions)
print(accuracy_score(val_y, val_predictions))

text_x = test_data[columns_of_interest]
test_predictions = titanic_model.predict(test_x)

(Also, note that I removed the Survived column from the columns_of_interest. I believe by including that column in your x data, you were giving the model the value that it was attempting to predict, which is likely why you were getting 1.0 for the validation as well. You're giving it the answers to the test.)

How do I go about using the test set to generate new predictions? — Onur-Andros Ozbek, Jul 22 '18 at 01:40
The print is still just printing the validation results (`val_predictions`). You could add the same for the `test_predictions` — bzier, Jul 23 '18 at 03:21
Validation is separate from the test. Validation is where you have the answer and can check how well the model is doing. The test set doesn't include the answers, so you just get the predictions, but can't know how accurate it is. It's like the 'real life' use of the model — bzier, Jul 23 '18 at 03:23
`ValueError: Classification metrics can't handle a mix of binary and continuous targets` — Onur-Andros Ozbek, Jul 24 '18 at 15:13
Where is that error coming from? Does it give you a stack trace and/or a line number? Take a look at this Q&A: https://stackoverflow.com/questions/38015181/accuracy-score-valueerror-cant-handle-mix-of-binary-and-continuous — bzier, Jul 24 '18 at 15:24
I've fixed that error. Now I'm trying to find a way to convert sex (male & female) to binary numbers. — Onur-Andros Ozbek, Jul 24 '18 at 15:25
Without pulling down the data and playing with this myself, I'm kind of at the extent of what I can help with here. I haven't actually used scikit-learn, and not familiar with the `DecisionTreeRegressor` class it provides. If you're stuck on it for too long, and can't find an answer elsewhere, you may consider asking another question. — bzier, Jul 24 '18 at 20:57
Hey man. Everything worked. Thanks a bunch. I've accepted your answer and gave you an upvote. If you think that this was a well asked question, could you give me an upvote? — Onur-Andros Ozbek, Jul 30 '18 at 16:03

How to use the test data against the trained model?

1 Answers1