Why is my training result not reproducible?

Question

I am training an xgbclassifier as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=42)

cv = StratifiedKFold(n_splits=3, random_state=42,shuffle=True) 

 params = { 
        'xgb': { 'clf__n_estimators': [100,200,400],'clf__max_depth': [3,5,7],'clf__learning_rate' : [0.01,0.1],'clf__min_child_weight':[3,10]}
            }
model = XGBClassifier(objective='binary:logistic',n_jobs=15_xgb,use_label_encoder=False,random_state = 42)
sel = SelectKBest(k='all')
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])


preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])

pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

from HyperclassifierSearch import HyperclassifierSearch
search = HyperclassifierSearch(pipe, params)
best_model = search.train_model(X_train, y_train, cv=cv,scoring='accuracy')

Each time i run the above, i get a different set of bestparams after grid search. How is this happening when i use random_state for each of below:

split the data
in my cross validation function
in my xgbclassifier model

for SelectKBest i also use 'all' features for now. But i am confused as to how i got different results on each run i did. FYI hyperclassufuer search is just a wrapper around gridsearch.

Any ideas on why this could be happening based on above? Is it perhaps the tree-method in xgbclassfier?

Related: https://stackoverflow.com/questions/61907618/impact-of-data-shuffling-on-results-reproducibility-in-pytorch — Ka Wa Yip, Feb 03 '22 at 11:47
@Ka-WaYip i have controlled for all points gthat doesn't really help — Maths12, Feb 04 '22 at 11:19

Why is my training result not reproducible?

0 Answers0