I want to use xgboost in python for my upcoming model. However since our production system is in SAS, I am trying to extract decision rules from xgboost and then write a SAS scoring code to implement this model in SAS environment.
I have gone through multiple links to this. below are some of them:
How to extract decision rules (features splits) from xgboost model in python3?
The above two links are of great help specifically the code given by Shiutang-Li for xgboost deployment. However, my predicted scores are not exactly matching.
below is the code I have tried so far:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.grid_search import GridSearchCV
%matplotlib inline
import graphviz
from graphviz import Digraph
#Read the sample iris data:
iris =pd.read_csv("C:\\Users\\XXXX\\Downloads\\Iris.csv")
#Create dependent variable:
iris.loc[iris["class"] != 2,"class"] = 0
iris.loc[iris["class"] == 2,"class"] = 1
#Select independent and dependent variable:
X = iris[["sepal_length","sepal_width","petal_length","petal_width"]]
Y = iris["class"]
xgdmat = xgb.DMatrix(X, Y) # Create our DMatrix to make XGBoost more efficient
#Build the sample xgboost Model:
our_params = {'eta': 0.1, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8,
'objective': 'binary:logistic', 'max_depth':3, 'min_child_weight':1}
Base_Model = xgb.train(our_params, xgdmat, num_boost_round = 10)
#Below code reads the dump file created by xgboost and writes a scoring code in SAS:
import re
def string_parser(s):
if len(re.findall(r":leaf=", s)) == 0:
out = re.findall(r"[\w.-]+", s)
tabs = re.findall(r"[\t]+", s)
if (out[4] == out[8]):
missing_value_handling = (" or missing(" + out[1] + ")")
else:
missing_value_handling = ""
if len(tabs) > 0:
return (re.findall(r"[\t]+", s)[0].replace('\t', ' ') +
' if state = ' + out[0] + ' then do;\n' +
re.findall(r"[\t]+", s)[0].replace('\t', ' ') +
' if ' + out[1] + ' < ' + out[2] + missing_value_handling +
' then state = ' + out[4] + ';' + ' else state = ' + out[6] + ';\nend;' )
else:
return (' if state = ' + out[0] + ' then do;\n' +
' if ' + out[1] + ' < ' + out[2] + missing_value_handling +
' then state = ' + out[4] + ';' + ' else state = ' + out[6] + ';\nend;' )
else:
out = re.findall(r"[\w.-]+", s)
return (re.findall(r"[\t]+", s)[0].replace('\t', ' ') +
' if state = ' + out[0] + ' then\n ' +
re.findall(r"[\t]+", s)[0].replace('\t', ' ') +
' value = value + (' + out[2] + ') ;\n')
def tree_parser(tree, i):
return ('state = 0;\n'
+ "".join([string_parser(tree.split('\n')[i]) for i in range(len(tree.split('\n'))-1)]))
def model_to_sas(model, out_file):
trees = model.get_dump()
result = ["value = 0;\n"]
with open(out_file, 'w') as the_file:
for i in range(len(trees)):
result.append(tree_parser(trees[i], i))
the_file.write("".join(result))
the_file.write("\nY_Pred1 = 1/(1+exp(-value));\n")
the_file.write("Y_Pred0 = 1 - Y_pred1;")
Call the above above module to create a SAS scoring code:
model_to_sas(Base_Model, 'xgb_scr_code.sas')
Unfortunately I can't provide full SAS code generated by the above module. However, please find below the SAS code if we build the model using only one tree code:
value = 0;
state = 0;
if state = 0 then
do;
if sepal_width < 2.95000005 or missing(sepal_width) then state = 1;
else state = 2;
end;
if state = 1 then
do;
if petal_length < 4.75 or missing(petal_length) then state = 3;
else state = 4;
end;
if state = 3 then value = value + (0.1586207);
if state = 4 then value = value + (-0.127272725);
if state = 2 then
do;
if petal_length < 3 or missing(petal_length) then state = 5;
else state = 6;
end;
if state = 5 then value = value + (-0.180952385);
if state = 6 then
do;
if petal_length < 4.75 or missing(petal_length) then state = 7;
else state = 8;
end;
if state = 7 then value = value + (0.142857149);
if state = 8 then value = value + (-0.161290333);
Y_Pred1 = 1/(1+exp(-value));
Y_Pred0 = 1 - Y_pred1;
Below is the dump file output for 1st tree:
booster[0]:
0:[sepal_width<2.95000005] yes=1,no=2,missing=1
1:[petal_length<4.75] yes=3,no=4,missing=3
3:leaf=0.1586207
4:leaf=-0.127272725
2:[petal_length<3] yes=5,no=6,missing=5
5:leaf=-0.180952385
6:[petal_length<4.75] yes=7,no=8,missing=7
7:leaf=0.142857149
8:leaf=-0.161290333
So basically, what I am trying to do is that, save the node number in the variable "state" and accordingly access the leaf nodes (which I learned from the article by Shiutang-Li mentioned in the above links).
Here is the issue I am facing:
For up to approx 40 trees, the predicted score is exactly matching. For example please see below:
Case 1:
Predicted value using python for 10 trees:
Y_pred1 = Base_Model.predict(xgdmat)
print("Development- Y_Actual: ",np.mean(Y)," Y predicted: ",np.mean(Y_pred1))
Output:
Average- Y_Actual: 0.3333333333333333 Average Y predicted: 0.4021197
Predicted value using SAS for 10 trees:
Average Y predicted: 0.4021197
Case 2:
Predicted value using python for 100 trees:
Y_pred1 = Base_Model.predict(xgdmat)
print("Development- Y_Actual: ",np.mean(Y)," Y predicted: ",np.mean(Y_pred1))
Output:
Average- Y_Actual: 0.3333333333333333 Average Y predicted: 0.33232176
Predicted value using SAS for 100 trees:
Average Y predicted: 0.3323159
As you can see the scores are not exactly matching(matches up to 4 decimal points) for 100 trees. Also, I have tried this on large files where the difference in scores is quite high i.e more than 10% deviation in scores.
Could anyone let me point towards any error in my code, so that scores can match exactly. Below is my some queries:
1)Is my score calculation correct.
2)I found something related to the gamma(regularization term). Does it impact the way xgboost calculates the scores using leaf values.
3)Does the leaf values given by dump files will have any rounding off, thus creating this issue
Also, I would appreciate any other method to do this task apart from parsing the dump file.
P.S.: I only have SAS EG and do not have access to SAS EM or SAS IML.