2

I'm creating a decision tree with the R rpart package based on x number of variables and a dataframe:

fit<-rpart(y~x1+x2+x3+x4,data=(mydataframe),
  control=rpart.control(minsplit = 20, minbucket = 0, cp=.01))

But instead of using the entire dataframe, I have four or five subsets of data that are factors, let's say separated out by x4. How can I run decision trees on all of these factors at once instead of having to call subsets of the data again and again?

Based on a search of SO, it looks like either BY or ddply might be the right choice. Here's what I've tried for ddply:

fit<-ddply(mydataframe, dataframe$x4, function (df)  
    rpart(y~x1+x2+x3+x4,data=(df), 
    control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))

but what I'm getting back is:

Error in eval(expr, envir, enclos) : object 'x4value' not found

where x4value is one of the variable values I'd like to split out by. So I have a column of values:

x4
BucketName1
BucketName2
BucketName3
BucketName4

str(mydataframe) shows that $x4 is a : Factor w/ 8 levels and no symbols.

Additionally, I ran mydataframe = na.omit(dataframe) at the very beginning to avoid nulls.

Possible issues I've already troubleshooted:

The rpart bit runs fine when I run it manually as such:

mydataframe<-subset(trainData, x4=="BucketName1")

fit<-rpart(y~x1+x2+x3+x4,data=(mydataframe), 
    control=rpart.control(minsplit = 20, minbucket = 0, cp=.01))

but borks whenever I try to loop through all subsets using ddply.

Complete reproducible sample code:

mydataframe<-data.frame  ( x1=sample(1:10),
                           x2=sample(1:10),
                           x3=sample(1:10),
                           x4= sample(letters[1:4], 20, replace = TRUE))
str(mydataframe)

fit<-ddply(mydataframe, mydataframe$x4, function (df)
    rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20,      minbucket = 0, cp=.01)))

Output:

str(mydataframe) 'data.frame':  20 obs. of  4 variables:  $ x1: int  1 6 8 4 7 9 3 2 10 5 ...  $ x2: int  9 4 5 8 6 3 7 10 2 1 ...  $ x3: int 2 6 5 3 1 4 9 7 10 8 ...  $ x4: Factor w/ 4 levels "a","b","c","d": 4 4 3 2 3 4 3 3 1 3 ...
> fit<-ddply(mydataframe, mydataframe$x4, function (df) rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20, minbucket = 0, cp=.01))) Error in eval(expr, envir, enclos) : object 'd' not found
Community
  • 1
  • 1
vko
  • 23
  • 5
  • Please take the time to create a minimal, [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data. Seems odd you getn an error about "x4value" when that doesn't appear anywhere in the code you've shared. Seems like you're leaving something important out. – MrFlick May 04 '15 at 19:37
  • Thanks for the hint, I've added some sample code. – vko May 04 '15 at 19:51

3 Answers3

1

You want to do two things with your code:

  1. Use dlply instead of ddply, since you want a list of rpart objects instead of a data frame of (?). ddply would be useful if you wanted to show predicted values of the original data, since that can be formatted into a data frame.

  2. Use .(x4) instead of dataframe$x4 in the dlply. Using the latter will produce unpredictable results.

Additionally, in your example, you should specify a y value and remove the .... from after x4

Max Candocia
  • 4,294
  • 35
  • 58
  • Thank you, this worked perfectly! I should have specified initially, but I also want to output the results with printcp(fit), but I'm getting `Error in printcp(fit) : 'x' must be an "rpart" object` for `printcp(fit)` and `plotcp(fit)`. Any hints to troubleshoot this part? – vko May 04 '15 at 20:35
  • Your result is in a list. You can do print(fit[[1]]) if you want to get the first result. You can also make an automated `l_ply()` function to do it for you, but your function would have to save results, since the plots would overwrite each other. You could also do something like `par(mfrow = c(2,2))` to get multiple plots per image. – Max Candocia May 04 '15 at 20:37
0

You are passing an incorrect value to the dplyr() .variables= parameter. You are either supposed to pass a quoted variable name, a formula, or a character vector of variable names. Since you are passing mydataframe$v4 that is being coerced to a character and it's looking for all the values in that column as if they were variables.

Here's what the call should look like

fit<-ddply(mydataframe, ~x4, function (df)
    rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))

or

fit<-ddply(mydataframe, .(x4), function (df)
    rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))

or

fit<-ddply(mydataframe, "x4", function (df)
    rpart(y~x1+x2+x3+x4,data=(df), control=rpart.control(minsplit = 20,  minbucket = 0, cp=.01)))
MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

If you're not comfortable with plyr, you can also do this with base R functions.

splitData = split(mydataframe, mydataframe$x4)

getModel = function(df) {
    fit <- rpart(y~x1+x2+x3+x4+xN....,data=df, 
        control=rpart.control(minsplit = 20, minbucket = 0, cp=.01)))
    return(fit)
}

models = lapply(splitData, getModel)

You can also do this with dplyr instead of plyr.

mydataframe %>% group_by(x4) %>%
   do(model = getModel(.))
Josh W.
  • 1,123
  • 1
  • 10
  • 17