1

I noticed that there are two possible implementations of XGBoost in Python as discussed here

When I tried running the same dataset through the two possible implementations I noticed that the results were different.

Using the low level API - xgboost.train(..)

dtrain = xgboost.DMatrix(X, label=Y, missing=0.0)
param = {'max_depth' : 3, 'objective' : 'reg:squarederror', 'booster' : 'gbtree'}
evallist = [(dtrain, 'eval'), (dtrain, 'train')]
num_round = 10
xgb_dMatrix = xgboost.train(param, dtrain, num_round, evallist)

Output

[0] eval-rmse:7115.31   train-rmse:7115.31
[1] eval-rmse:5335.37   train-rmse:5335.37
[2] eval-rmse:4054.77   train-rmse:4054.77
[3] eval-rmse:3140.91   train-rmse:3140.91
[4] eval-rmse:2510.33   train-rmse:2510.33
[5] eval-rmse:2080.62   train-rmse:2080.62
[6] eval-rmse:1785.53   train-rmse:1785.53
[7] eval-rmse:1571.92   train-rmse:1571.92
[8] eval-rmse:1399.57   train-rmse:1399.57
[9] eval-rmse:1301.64   train-rmse:1301.64

Using the Scikit Wrapper - xgboost.XGBRegressor(..)

xgb_reg = xgboost.XGBRegressor(max_depth=3, n_estimators=10)
xgb_reg.fit(X_train, Y_train, eval_set = [(X_train, Y_train), (X_train, Y_train)], eval_metric = 'rmse', verbose=True)

Output

[0] validation_0-rmse:8827.63   validation_1-rmse:8827.63
[1] validation_0-rmse:8048.16   validation_1-rmse:8048.16
[2] validation_0-rmse:7349.83   validation_1-rmse:7349.83
[3] validation_0-rmse:6720.69   validation_1-rmse:6720.69
[4] validation_0-rmse:6154.82   validation_1-rmse:6154.82
[5] validation_0-rmse:5637.49   validation_1-rmse:5637.49
[6] validation_0-rmse:5173.9    validation_1-rmse:5173.9
[7] validation_0-rmse:4759.14   validation_1-rmse:4759.14
[8] validation_0-rmse:4386.29   validation_1-rmse:4386.29
[9] validation_0-rmse:4051.63   validation_1-rmse:4051.63

I thought the parameters were the cause for the difference so I fetched the parameters from the scikit wrapper implementation and passed it to the low level API implementation and still observed that the results were different. Code for parameters

xgb_reg.get_params()

Just wondering what could be the possible reason why the results are not matching between the two versions which internally are similar?

Allen
  • 21
  • 3
  • Please check the duplicate question marked above. There is a difference that the `xgb_reg.get_params()` cannot handle. Please let me know if you have tried that already and still not matching and also update your code with [a reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) so that I can reopen this. – Vivek Kumar Dec 03 '19 at 12:34
  • The default parameters are different. For example learning_rate is 0.1 in Scikit-learn and 0.3 in the API. This happens with more parameters. Try to fix them. – user2874583 Dec 03 '19 at 15:51
  • Thanks, once I matched learning rate and max depth I was able to match the data for the small dataset. But having some trouble matching the results for the larger datasets like the boston dataset [here](https://stackoverflow.com/questions/59395651/difference-is-value-between-xgb-train-and-xgb-xgbregressor-in-python-for-certain) – Allen Dec 18 '19 at 15:53

0 Answers0