0

I am trying to create a function to generate multiple Random forest models based on column value. Let suppose we :

df <- data.frame(Name= c('Aaron','Bob','Nik','Peter','George'),
                           Work=c('A','B','B','C','A')
                           ,Age = c(45,28,64,27,54)
                           ,cl = c(1,2,2,3,1))

Name Work Age cl
Aaron  A  45  1
Bob    B  28  2
Nik    B  64  2
Peter  C  27  3
George A  54  1

So, I have to subset data based on cl and then build models based on cl values like: In above example I have 3 cl values.So, first I will divide data into three subset and build three different models.

Name Work Age cl              Name Work Age cl            Name Work Age cl  
Aaron  A  45  1               Bob    B  28  2             Peter  C  27  3
George A  54  1               Nik    B  64  2

I have used below function to do this:

for(i in unique(uk$v10v11)) {
  nam <- paste("df", i, sep = ".")
  assign(nam, uk[uk$v10v11==i,])
}

I want to make complete function where I can supply my df and it should build multiple models based on cl. I also want to tune parameters for the random forest from function itself for each model. Please help.

  • 2
    You can use `split`, Also look at [here](https://stackoverflow.com/questions/18913447/splitting-a-data-frame-by-a-variable) – akrun Jul 17 '17 at 15:41
  • 1
    after split using `list` of dataframe and `for loop ` to training model and tuning parameter – BENY Jul 17 '17 at 15:49
  • But, there could be n numbers in cl. I am not sure how many values could appear in cl. –  Jul 17 '17 at 16:00

1 Answers1

0

i would recommend watching this video from hadley wickham when you have the time. it relates very much to your challenge.

this also seems like a classic split-apply-combine problem, so my first thought is to consider the tidyverse. here is some code that might help you:

library(tidyverse)
library(randomForest)

df2 <- df %>% group_by(cl) %>% mutate(rfcol=list(randomForest(x=.,
                                  formula=.$cl~.$Work+.$Age)))

basically a new column has been created that contains the randomforest algorithm appropriate for that row based on its value in cl. you can explore the details of each model by looking at df2$rfcol[[2]]

to summarize what's going on, the group_by function gets you started with creating dataframes based on cl values. the . within the randomForest function nested within mutate is a way of referencing each grouped dataframe.

hope this helps. but as noted, try watching that video from hadley wickham if you have the time. it will really explain how to think about these types of problems in detail.

simitpatel
  • 641
  • 5
  • 9
  • I have seen the video you mentioned. It is really helpful. But, my major issue is with tuning parameter for each model separately from function itself. –  Jul 18 '17 at 02:35