1

I noticed that there are two possible implementations of XGBoost in Python as discussed here and here

When I tried running the same dataset through the two possible implementations I noticed that the results were different.

Code

import xgboost as xgb
from xgboost.sklearn import XGBRegressor
import xgboost
import pandas as pd
import numpy as np
from sklearn import datasets

boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)

Y = df["target"]
X = df.drop('target', axis=1)

#### Code using Native Impl for XGBoost
dtrain = xgboost.DMatrix(X, label=Y, missing=0.0)
params = {'max_depth': 3, 'learning_rate': .05, 'min_child_weight' : 4, 'subsample' : 0.8}
evallist = [(dtrain, 'eval'), (dtrain, 'train')]

model = xgboost.train(dtrain=dtrain, params=params,num_boost_round=200)

predictions = model.predict(dtrain)

#### Code using Sklearn Wrapper for XGBoost
model = XGBRegressor(n_estimators = 200, max_depth=3, learning_rate =.05, min_child_weight=4, subsample=0.8 )

#model = model.fit(X, Y, eval_set = [(X, Y), (X, Y)], eval_metric = 'rmse', verbose=True)
model = model.fit(X, Y)

predictions2 = model.predict(X)

print(np.absolute(predictions-predictions2).sum())

Absolute difference sum using sklearn boston dataset

62.687134

When I ran the same for other datasets like the sklearn diabetes dataset I observed that the difference was much smaller.

Absolute difference sum using sklearn diabetes dataset

0.0011711121
Allen
  • 21
  • 3
  • Another observation was when I train with a single sparse feature with negative and positive values, the values don't seem to match. – Allen Dec 20 '19 at 07:00

2 Answers2

0

Make sure random seeds are the same.

For both approaches set the same seed

param['seed'] = 123

EDIT: then there are a couple of different things. First is n_estimators also 200? Are you imputing missing values in the second dataset also with 0? are others default values also the same(for this one I think yes because its a wrapper, but check other 2 things)

Noah Weber
  • 312
  • 2
  • 13
  • I tried setting the same seed in both cases, found that there was still a non-zero difference. The number of estimators are also 200 in both the cases and I tested with a dataset with no missing values and observed that there still was a difference. – Allen Dec 19 '19 at 06:16
0

I've not set the "missing" parameter for the sklearn implementation. Once that was set the values were matching.

Also as Noah pointed out, sklearn wrapper has a few different default values which needs to be matched in order to exactly match the results.

Allen
  • 21
  • 3