14

I am new to XGBoost in Python so I apologize if the answer here is obvious, but I am trying to take a panda dataframe and get XGBoost in Python to give me the same predictions I get when I use the Scikit-Learn wrapper for the same exercise. So far I've been unable to do so. Just to give an example, here I take the boston dataset, convert to a panda dataframe, train on the first 500 observations of the dataset and then predict the last 6. I do it with XGBoost first and then with the Scikit-Learn wrapper and I get different predictions even though I've set the parameters of the model to be the same. Specifically the array predictions looks very different from the array predictions2 (see code below). Any help would be much appreciated!

from sklearn import datasets
import pandas as pd
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from xgboost.sklearn import XGBRegressor

### Use the boston data as an example, train on first 500, predict last 6 
boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df_boston['target'] = pd.Series(boston_data.target)


#### Code using XGBoost
Sub_train = df_boston.head(500)
target = Sub_train["target"]
Sub_train = Sub_train.drop('target', axis=1) 

Sub_predict = df_boston.tail(6)
Sub_predict = Sub_predict.drop('target', axis=1)  

xgtrain = xgb.DMatrix(Sub_train.as_matrix(), label=target.tolist())
xgtest = xgb.DMatrix(Sub_predict.as_matrix())

params = {'booster': 'gblinear', 'objective': 'reg:linear', 
      'max_depth': 2, 'learning_rate': .1, 'n_estimators': 500,    'min_child_weight': 3, 'colsample_bytree': .7,
      'subsample': .8, 'gamma': 0, 'reg_alpha': 1}

model = xgb.train(dtrain=xgtrain, params=params)

predictions = model.predict(xgtest)

#### Code using Sk learn Wrapper for XGBoost
model = XGBRegressor(learning_rate =.1, n_estimators=500,
max_depth=2, min_child_weight=3, gamma=0, 
subsample=.8, colsample_bytree=.7, reg_alpha=1, 
objective= 'reg:linear')

target = "target"

Sub_train = df_boston.head(500)
Sub_predict = df_boston.tail(6)
Sub_predict = Sub_predict.drop('target', axis=1)

Ex_List = ['target']

predictors = [i for i in Sub_train.columns if i not in Ex_List]

model = model.fit(Sub_train[predictors],Sub_train[target])

predictions2 = model.predict(Sub_predict)
Joseph E
  • 143
  • 1
  • 1
  • 5

1 Answers1

18

Please look at this answer here

xgboost.train will ignore parameter n_estimators, while xgboost.XGBRegressor accepts. In xgboost.train, boosting iterations (i.e. n_estimators) is controlled by num_boost_round(default: 10)

It suggests to remove n_estimators from params supplied to xgb.train and replace it with num_boost_round.

So change your params like this:

params = {'objective': 'reg:linear', 
      'max_depth': 2, 'learning_rate': .1,    
      'min_child_weight': 3, 'colsample_bytree': .7,
      'subsample': .8, 'gamma': 0, 'alpha': 1}

And train xgb.train like this:

model = xgb.train(dtrain=xgtrain, params=params,num_boost_round=500)

And you will get same results.

Alternatively, keep the xgb.train as it is and change the XGBRegressor like this:

model = XGBRegressor(learning_rate =.1, n_estimators=10,
                     max_depth=2, min_child_weight=3, gamma=0, 
                     subsample=.8, colsample_bytree=.7, reg_alpha=1, 
                     objective= 'reg:linear')

Then also you will get same results.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Thank you so much! I reran the code with the first set of changes you suggested and now the arrays (predictions and predictions2) match perfectly. – Joseph E Oct 26 '17 at 06:36
  • @VivekKumar: Please, how do you reproduce num_boost_round=500 from xgb.train in XGBRepressor? Same, how to you specify the fact in xgb.train that you will use either a classifier or regressor? the objective function might do that, but as you can specify any combination such as binary:logistic in XGBregressor, i guess this is not the parameter used for this. Thanks. – Ando Jurai Jun 01 '18 at 13:46
  • 1
    @AndoJurai You need to use n_estimators = 500 for that. But I am not getting the second part of your question. Please explain in more detail what you want? – Vivek Kumar Jun 01 '18 at 13:50
  • Thanks. Well, I want to know which parameter to set in xgb.train so that it knows it is training a classifier or a regressor and not the other way around? Does it infer from the labels? I ask because if there are XGBregressor and XGB classifier wrappers, there must be some important difference between these. So how does it work at xgb.train level? – Ando Jurai Jun 01 '18 at 14:04
  • 1
    @AndoJurai Check here:http://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters. If still not satisfied, start a new question with relevant info. – Vivek Kumar Jun 01 '18 at 14:11
  • Well, actually there is nothing about this in the linked page. I also read this but still don't understand as you seem to be able to pass either "reg:logistic" or "binary:logistic" to a XGBRegressor/Classifier and it is thus not clear how "train" would distinguish between both. – Ando Jurai Jun 01 '18 at 14:18
  • 1
    @AndoJurai DId you look the objective parameter on linked page. All the options there are valid. `"reg:linear"` is for regression. See the heading: "Learning Task Parameters" – Vivek Kumar Jun 01 '18 at 14:27
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/172252/discussion-between-ando-jurai-and-vivek-kumar). – Ando Jurai Jun 01 '18 at 14:42