8

I am working with the R programming language. I am trying to fit a Random Forest model on a very large dataset (over 100 million rows) with imbalanced classes (i.e. binary response variable ratio 95% to 5%). To do this, the R code I wrote:

  • Step 1: Creates a training set and a test set for the sake of this Stackoverflow question
  • Step 2: Uses sampling with replacement to create many random (smaller) subsets from the training set with a better distribution of the response variable (this is an attempt to increase the "true accuracy" of the model)
  • Step 3: Fits a Random Forest model to each of these random subsets and saves each model to the working directory (in case the computer crashes). Note - I am using the "ranger" package instead of the "randomForest" package because I read that the "ranger" package is faster.
  • Step 4: Combines all these models into a single model - and then makes predictions on the test set with this combined model

Below, I have included the R code for these steps:

Step 1: Create Data for Problem

# Step 1: Randomly create data and make initial training/test set:


library(dplyr)
library(ranger)

original_data = rbind( data_1 = data.frame( class = 1, height = rnorm(10000, 180,10), weight = rnorm(10000, 90,10), salary = rnorm(10000,50000,10000)),  data_2 = data.frame(class = 0, height = rnorm(100, 160,10), weight = rnorm(100, 100,10), salary = rnorm(100,40000,10000)) )

original_data$class = as.factor(original_data$class)
original_data$id = 1:nrow(original_data)

test_set=  rbind(original_data[ sample( which( original_data$class == "0" ) , replace = FALSE , 30 ) , ], original_data[ sample( which( original_data$class == "1" ) , replace = FALSE, 2000 ) , ])

train_set = anti_join(original_data, test_set)

Step 2: Create "Balanced" Random Subsets:

# Step 2: Create "Balanced" Random Subsets:

results <- list()
for (i in 1:100)
   
{
   iteration_i = i
   
    sample_i =  rbind(train_set[ sample( which( train_set$class == "0" ) , replace = TRUE , 50 ) , ], train_set[ sample( which( train_set$class == "1" ) , replace = TRUE, 60 ) , ])
   
    results_tmp = data.frame(iteration_i, sample_i)
    results_tmp$iteration_i = as.factor(results_tmp$iteration_i)
   results[[i]] <- results_tmp
   
}

results_df <- do.call(rbind.data.frame, results)

X<-split(results_df, results_df$iteration)

 invisible(lapply(seq_along(results),
       function(i,x) {assign(paste0("train_set_",i),x[[i]], envir=.GlobalEnv)},
       x=results))

Step 3: Train Models on Each Subset

# Step 3: Train Models on Each Subset:

#training
wd = getwd()
results_1 <- list()

for (i in 1:100){
     
    model_i <- ranger(class ~  height + weight + salary, data = X[[i]], probability = TRUE)
    saveRDS(model_i, paste0("wd", paste("model_", i, ".RDS")))
    results_1[[i]] <- model_i   
}

Step 4: Combine All Models and Use Combined Model to Make Predictions on the Test Set:

# Step 4: Combine All Models and Use Combined Model to Make Predictions on the Test Set:
results_2 <- list()
for (i in 1:100){
predict_i <- data.frame(predict(results_1[[i]], data = test_set)$predictions)


predict_i$id = 1:nrow(predict_i)
 results_2[[i]] <- predict_i
   
}

final_predictions = aggregate(.~ id, do.call(rbind, results_2), mean)

My Question: I would like to see if I can incorporate "parallel computing" into Step 2, Step 3 and Step 4 to potentially make the code I have written run faster. I consulted other posts (e.g.https://stackoverflow.com/questions/14106010/parallel-execution-of-random-forest-in-r, https://stats.stackexchange.com/questions/519640/parallelizing-random-forest-learning-in-r-changes-the-class-of-the-rf-object) and I would like to see if I can reformat the code I have written and incorporate similar "parallel computing" functions for improving my code:

library(parallel)
library(doParallel)
library(foreach)

#Try to parallelize
cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)

# Insert Reformatted Step 2 - Step 4 Here:

stopImplicitCluster()
stopCluster(cl)
rm(cl)

But I am still new to the world of parallel computing and still trying to figure out how to reformat my code so that this will work.

Can someone please show me how to do this?

Note:

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 3
    I just started playing with the tidymodels package: https://www.tidymodels.org/ Apart from your parallelization question, that framework would significantly simplify your model construction. A good introduction tutorial to a random forest that I worked through can be found here: https://www.gmudatamining.com/lesson-13-r-tutorial.html#Decision_Trees If you back out to earlier lessons they will show you how to create recipes and models for your workflows. The tidymodels page has more comprehensive tutorials. – SteveM Jun 19 '22 at 02:52
  • @ SteveM : Thank you for your reply and your suggestions - I will check these links out. Thank you so much! – stats_noob Jun 19 '22 at 02:57
  • 1
    @stats_noob Please also check the book "tidy machine learning with R" which is written by Max Kuhn, who is the chief developer of tidymodels. He has used doParallel and doMC packages for parallelization. doMC is for linux system, and doParalled is for windows. You will get an idea how these packages are used. – Eva Jun 19 '22 at 04:03
  • @ Eva: Thank you for your reply! I have this book saved on my computer (https://www.tmwr.org/categorical.html) and have been going over parts of this - it seems very useful! s I understand, libraries like "caret/tidymodels" are able to parallelize a single model (e.g. cross validating a single random forest model). In my case, I am not sure if "tidymodels" will be able to parallelize all these models together in this exact format .... maybe this can be done...but I am not sure how to do that. Any thoughts about this? Thank you so much! – stats_noob Jun 19 '22 at 04:05
  • 3
    1. Note: ranger already runs in parallel itself. You would have to run ranger on 1 core if you want to parallelize the for loop. ranger is a lot faster than randomForest if you have a lot of rows. You might want to look into h2o and run on a cluster. Also read up on upsampling if you have imbalanced classes. Before you go into all of this, run a few basic models to see what the prediction scores are so you have a baseline and don't forget to set the seed for reproducibility. – phiver Jun 19 '22 at 07:16
  • 1
    2. You can also use stratified folds to assure you have the same representation of your class in each fold. Or if you are looking into low class predictions, like fraud, check out (extended) isolation forests or any other anomaly detection and get domain knowledge experts involved to see if certain variables are more important. – phiver Jun 19 '22 at 07:19
  • @ phiver: thank you so much for reply! I am looking into this! – stats_noob Jun 19 '22 at 17:33
  • For parallel computing, do you have access to a cluster computer or a high core count machine? If you do, consider the randomForest parallelization example algorithms described in https://arxiv.org/abs/1709.01195. – George Ostrouchov Feb 24 '23 at 22:43

1 Answers1

3

Noting your openness to a tidymodels approach, you could try this using your original_data and including parallel processing:

library(tidyverse)
library(tidymodels)
library(vip)
library(doParallel)
library(tictoc)
library(themis)

registerDoParallel(cores = 6)

# Supplied data
set.seed(2022)

original_data <- rbind(
  data_1 = data.frame(
    class = 1,
    height = rnorm(10000, 180, 10),
    weight = rnorm(10000, 90, 10),
    salary = rnorm(10000, 50000, 10000)
  ),
  data_2 = data.frame(
    class = 0,
    height = rnorm(100, 160, 10),
    weight = rnorm(100, 100, 10),
    salary = rnorm(100, 40000, 10000)
  )
)

original_data$class <- as.factor(original_data$class)
original_data$id <- 1:nrow(original_data)

tic()

# Train / test data
set.seed(2022)

data_split <- 
  original_data |>
  initial_split(strata = class) # stratify by class

train_df <- data_split |> training()
test_df <- data_split |> testing()

# Create a pre-processing recipe
class_recipe <-
  train_df |>
  recipe() |>
  update_role(class, new_role = "outcome") |>
  update_role(id, new_role = "id") |>
  update_role(-has_role("outcome"), -has_role("id"), new_role = "predictor") |> 
  step_rose(class)

# Check class balance
class_recipe |> prep() |> bake(new_data = NULL) |> count(class)
#> # A tibble: 2 × 2
#>   class     n
#>   <fct> <int>
#> 1 0      7407
#> 2 1      7589

summary(class_recipe)
#> # A tibble: 5 × 4
#>   variable type    role      source  
#>   <chr>    <chr>   <chr>     <chr>   
#> 1 class    nominal outcome   original
#> 2 height   numeric predictor original
#> 3 weight   numeric predictor original
#> 4 salary   numeric predictor original
#> 5 id       numeric id        original

# Create model & workflow
ranger_model <- 
  rand_forest(mtry = tune()) |>
  set_engine("ranger", importance = "impurity") |>
  set_mode("classification")

ranger_wflow <- workflow() |>
  add_recipe(class_recipe) |>
  add_model(ranger_model)

# Tune model with 10-fold Cross Validation
set.seed(2022)

folds <- vfold_cv(train_df, v = 10)

set.seed(2022)

ranger_res <- ranger_wflow |> 
  tune_grid(
    resamples = folds,
    grid = crossing(
      mtry = seq(1, 3, 1),
    ),
    control = control_grid(verbose = TRUE),
    metrics = metric_set(accuracy) # choose a metric, e.g. accuracy
  )

# Fit model
best_tune <- ranger_res |> select_best()

set.seed(2022)

ranger_fit <- ranger_wflow |> 
  finalize_workflow(best_tune) %>% 
  fit(train_df)

# Test
class_results <- ranger_fit |> augment(new_data = test_df)

class_results |> accuracy(class, .pred_class)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.912

# Visualise feature importance
ranger_fit |>
  extract_fit_parsnip() |> 
  vip() +
  labs(title = "Feature Importance -- Ranger")

toc()
#> 62.393 sec elapsed

Created on 2022-06-21 by the reprex package (v2.0.1)

Carl
  • 4,232
  • 2
  • 12
  • 24
  • @ Carl: thank you so much for your answer! I am still new to tidymodels - does the code you have provided able to address the class imbalance problem? For example, the training set in my example is sampled in a way to address the class balance. I am trying to make sure that each decision tree in the random forest is trained on a balanced dataset. Is this possible? Thanks! – stats_noob Jun 21 '22 at 15:18
  • Yes. `recipes` in tidymodels has [many options for up/down sampling etc.](https://recipes.tidymodels.org/reference/step_downsample.html). `step_downsample` is just one example. Example [article here](https://www.tidymodels.org/learn/models/sub-sampling/). – Carl Jun 21 '22 at 15:24
  • Essentially adds 1 line to the code creating the `class_recipe`. – Carl Jun 21 '22 at 15:31
  • @ Carl: thank you so much for your reply! Just a question - do you know if it is possible to use parallel processing with the code I have already written? – stats_noob Jun 21 '22 at 15:40
  • I've added a `step_rose` to the recipe and also a line below that to show the corrected imbalance. – Carl Jun 21 '22 at 15:46
  • @ Carl: thank you for you this! I am also hoping that someone might show how to do this without tidy/caret as well! – stats_noob Jun 26 '22 at 17:22