5

I have this problem with xgboost I use at work. My task is to port a piece of code that's currently running in R to python.

What the code does: My aim is to use XGBoost to determine the features with most gain. I made sure the inputs into the XGBoost are identical in R and python. The XGBoost is run roughly 100 times (on different data) and each time I extract 30 best features by gain.

My problem is this: The input in R and python are identical. Yet python and R output vastly different features(both in terms of total number of features per round, and which features are chosen). They only share about 50 % of features. My parameters are the same, and I don't use any samples, so there is no randomness.

Also, another thing I noticed - XGBoost is slower in python when compared to R with the same parameters. Is it a known issue?

R parameters

Python parameters

I've been trying to look around, but didn't find anyone having a similar problem. I can't share the data or code, because it's confidential. Does someone have an idea why the features differ so much?

R version: 3.4.3

XGBoost R version: 0.6.4.1

python version: 3.6.5

XGBoost python version: 0.71

Running on Windows.

Johny
  • 101
  • 6
  • 1
    As someone who had to do this earlier this year, all I thought was that the R xgboost modeling was significantly better than python – Adam Warner Jun 06 '18 at 15:05
  • @AdamWarner Have you found out why by any chance? And have you tried both the xgboost.train and the sklrean XGBClassifier or XGBRegressor? Was there a significant difference? – Johny Jun 06 '18 at 16:30

1 Answers1

1

You set the internal seed in the R code but not the Python code.

More of an issue is likely that Python and R may also use different random number generators so despite always setting internal and external seeds you could get different sequences. This thread may help in that respect.

I would also hazard a guess that the variables not selected in one model provide similar information to those selected in the other, where swapping variables one way or another shouldn't impact model performance significantly. Although I don't know if the R model and the Python one perform the same?

Community
  • 1
  • 1
  • To clarify, I used both xgboost.train and XGBClassifier in python. xgboost.train doesn't have a seed parameter (it gave me an identical result after running it twice though, so I guess it does have a default value). However, in XGBC I did set it to 0 and it performed even worse than xgboost.train. I use xgboost to only extract the best predictors, and then the filtered dataset is going to be fed into a neural network. That part isn't done yet, so it remains to be seen if the perfomance differs and if so, by how much and to whose favour. Anyway, thanks for the answer! – Johny Jun 06 '18 at 16:44