4

Python: 3.6

Windows: 10

I have few question regarding Random Forest and problem at hand:

I am using Gridsearch to run regression problem using Random Forest. I want to plot the tree corresponding to best fit parameter that gridsearch has found out. Here is the code.

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

    # Use the random grid to search for best hyperparameters
    # First create the base model to tune
    rf = RandomForestRegressor()
    # Random search of parameters, using 3 fold cross validation, 
    # search across 100 different combinations, and use all available cores
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=56, n_jobs = -1)
    # Fit the random search model
    rf_random.fit(X_train, y_train)

    rf_random.best_params_

The best parameter came out to be is:

    {'n_estimators': 1000,
     'min_samples_split': 5,
     'min_samples_leaf': 1,
     'max_features': 'auto',
     'max_depth': 5,
     'bootstrap': True}
  1. How can I plot this tree using above parameter?

  2. My dependent variable y lies in range [0,1] (continuous) and all predictor variables are either binary or categorical. Which algorithm in general can work well fot this input and output feature space. I tried with Random Forest. (Didn't give that good result). Note here y variable is a kind of ratio therefore its between 0 and 1. Example: Expense on food/Total Expense

  3. The above data is skewed that means the dependent or y variable has value=1 in 60% of data and somewhere between 0 and 1 in rest of data. like 0.66, 0.87 so on.

  4. Since my data has only binary {0,1} and categorical variables {A,B,C}. Do I need to convert it into one-hot encoding variable for using random forest?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
MAC
  • 1,345
  • 2
  • 30
  • 60

2 Answers2

2

Allow me to take a step back before answering your questions.

Ideally one should drill down further on the best_params_ of RandomizedSearchCV output through GridSearchCV. RandomizedSearchCV will go over your parameters without trying out all the possible options. Then once you have the best_params_ of RandomizedSearchCV, we can investigate all the possible options across a more narrower range.

You did not include random_grid parameters in your code input, but I would expect you to do a GridSearchCV like this:

# Create the parameter grid based on the results of RandomizedSearchCV
param_grid = {
    'max_depth': [4, 5, 6],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [4, 5, 6],
    'n_estimators': [990, 1000, 1010]
}
# Fit the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2, random_state=56)

What the above will do is to go through all the possible combinations of parameters in the param_grid and give you the best parameter.

Now coming to your questions:

  1. Random forests are a combination of multiple trees - so you do not have only 1 tree that you can plot. What you can instead do is to plot 1 or more the individual trees used by the random forests. This can be achieved by the plot_tree function. Have a read of the documentation and this SO question to understand it more.

  2. Did you try a simple linear regression first?

  3. This would impact what kind of accuracy metrics you would utilize to assess your model's fit/accuracy. Precision, recall & F1 scores come to mind when dealing with unbalanced/skewed data

  4. Yes, categorical variables need to be converted to dummy variables before fitting a random forest

finlytics-hub
  • 164
  • 1
  • 9
  • What you have suggested above for plotting tree : works well with random forest classifier but it doesn't work against regressor – MAC May 31 '20 at 19:40
  • @MAC As per scikit learn's documentation, the plot_tree function can be used for both classifiers and regressors. Although I must admit that I have never applied it to regressors. – finlytics-hub Jun 01 '20 at 03:12
  • `I have written: grid = GridSearchCV(estimator=xgb, param_grid=params, scoring='neg_mean_squared_error', n_jobs=4, verbose=3 ) and grid.fit(X_train, y_train)`. Now how can I draw tree based on best estimator?? – MAC Jun 02 '20 at 11:28
  • 2
    @MAC XGBoost and Random Forests are an ensemble of multiple decision trees. There is no one single tree that can represent the best parameters. One can however draw a specific tree within a trained XGBoost model using `plot_tree(grid, num_trees=0)`. Replace 0 with the nth decision tree that you want to visualize. To find out the number of trees in your `grid` model, check the its `n_estimators`. – finlytics-hub Jun 02 '20 at 11:44
2

Regarding the plot (I am afraid that your other questions are way too-broad for SO, where the general idea is to avoid asking multiple questions at the same time):

Fitting your RandomizedSearchCV has resulted in an rf_random.best_estimator_, which in itself is a random forest with the parameters shown in your question (including 'n_estimators': 1000).

According to the docs, a fitted RandomForestRegressor includes an attribute:

estimators_ : list of DecisionTreeRegressor

The collection of fitted sub-estimators.

So, to plot any individual tree of your Random Forest, you should use either

from sklearn import tree
tree.plot_tree(rf_random.best_estimator_.estimators_[k])

or

from sklearn import tree
tree.export_graphviz(rf_random.best_estimator_.estimators_[k])

for the desired k in [0, 999] in your case ([0, n_estimators-1] in the general case).

Community
  • 1
  • 1
desertnaut
  • 57,590
  • 26
  • 140
  • 166