1

I am new to Machine learning and getting started with Titanic problem on Kaggle. I have written a simple algorithm to predict the result on test data.

My question/confusion is, every time, I execute the algorithm with the same dataset and the same steps, the score value changes (last statement in the code). I am not able to understand this behaviour?

Code:

# imports
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier

# load data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
results = pd.read_csv('gender_submission-orig.csv')

# prepare training and test dataset
y = train['Survived']
X = train.drop(['Survived', 'SibSp', 'Ticket', 'Cabin', 'Embarked', 'Name'], axis=1)
test = test.drop(['SibSp', 'Ticket', 'Cabin', 'Embarked', 'Name'], axis=1)
y_test = results['Survived']

X = pd.get_dummies(X)
test = pd.get_dummies(test)

# fill the missing values
age_median = X['Age'].median()
fare_median = X['Fare'].median()

X['Age'] = X['Age'].fillna(age_median)
test['Age'].fillna(age_median, inplace=True)
test['Fare'].fillna(fare_median, inplace=True)

# train the classifier and predict
clf = DecisionTreeClassifier()
clf.fit(X, y)
predict = clf.predict(test)

# This is the score which changes with execution.
print(round(clf.score(test, y_test) * 100, 2)) 
desertnaut
  • 57,590
  • 26
  • 140
  • 166
YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
  • 1
    Try DecisionTreeClassifier(random_state=42) – Dani Mesejo Dec 25 '18 at 13:51
  • Thanks @DanielMesejo, it worked. – YoungHobbit Dec 25 '18 at 13:54
  • But with the different values, the score also changes. So how do we find the optimal or right value? – YoungHobbit Dec 25 '18 at 13:57
  • 5
    There is an *inherent* randomness in these algorithms, beyond which you simply cannot go. Setting the random seed just ensures reproducibility of a specific model/script, but finding any "optimal" value in the sense you mean it here (i.e. regarding the random parts) is not possible. – desertnaut Dec 25 '18 at 14:01

1 Answers1

5

This is a usual frustration with which newcomers in the field are faced. The cause is the inherent randomness in this kind of algorithms, and the simple & straightforward remedy, as already has been suggested in the comments, is to explicitly set the state (seed) of the random number generator, e.g.:

clf = DecisionTreeClassifier(random_state=42) 

But with the different values, the score also changes. So how do we find the optimal or right value?

Again, this is expected and it cannot be overcome: this kind of randomness is a fundamental & irreversible one, beyond which you simply cannot go. Setting the random seed as suggested above just ensures reproducibility of a specific model/script, but finding any "optimal" value in the sense you mean it here (i.e. regarding the random parts) is not possible. Statistically speaking, the results produced by different values of the random seed should be similar (in the statistical sense), but exact quantification of this similarity is an exercise in rigorous statistics that goes well beyond the scope of this post.

Randomness is often a non-intuitive realm, and random number generators (RNGs) themselves are strange animals... As a general note, you might be interested to know that RNG's are not even "compatible" across different languages & frameworks.

desertnaut
  • 57,590
  • 26
  • 140
  • 166