-2

When running a decision tree I use:

mod1 <- C5.0(Species ~ ., data = iris)

If I want to pass in a data frame and set the target feature name in the formula (something different than "Species") how would I do this?

For example,

mod1 <- C5.0(other_data[,target_column] ~ ., data = other_data)

which obviously doesn't work.

Cybernetic
  • 12,628
  • 16
  • 93
  • 132

3 Answers3

3

1) Paste together the formula:

fun <- function(resp, data) C5.0(as.formula(paste(resp, "~ .")), data = data)

# test
library(C50)
fun("Species", iris)

giving:

Call:
C5.0.formula(formula = as.formula(paste(resp, "~ .")), data = data)

Classification Tree
Number of samples: 150 
Number of predictors: 4 

Tree size: 4 

Non-standard options: attempt to group attributes

2) Or this variation which gives nicer rendition of the call on the line after Call: in the output:

fun <- function(resp, data) 
  do.call(C5.0, list(as.formula(paste(resp, "~ .")), data = substitute(data)))
fun("Species", iris)

giving:

Call:
C5.0.formula(formula = Species ~ ., data = iris)

Classification Tree
Number of samples: 150 
Number of predictors: 4 

Tree size: 4 

Here is a second test of this version of fun using the builtin data frame CO2:

fun("Plant", CO2)

giving:

Call:
C5.0.formula(formula = Plant ~ ., data = CO2)

Classification Tree
Number of samples: 84 
Number of predictors: 4 

Tree size: 7 

Non-standard options: attempt to group attributes
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • it says object "resp" not found – Cybernetic May 11 '16 at 23:03
  • The output is shown above and it does not give that error for me. I am using C50 version ‘0.1.0.24’ and "R version 3.3.0 Patched (2016-05-03 r70575)" on Windows. – G. Grothendieck May 11 '16 at 23:04
  • I see you are using "species" explicitly, and also the iris dataset. I cannot get this to work with an arbitrary target feature name, or an arbitrary dataset..as shown in my question – Cybernetic May 11 '16 at 23:22
  • The function is completely general. It does not know anything about the iris data set. I have added another test at the end that uses a different data set and it works too. You will need to provide a complete minimal self contained example in your question that shows that this answer does not work if you would like more help since it consistently works for me. – G. Grothendieck May 11 '16 at 23:55
0

An alternative that may be preferable is to overwrite the symbol within the parse tree after creating the formula:

x <- Species~.;
x;
## Species ~ .
x[[2L]] <- as.symbol('Blah');
x;
## Blah ~ .

The above works because formulas are encoded as normal parse trees, with a top-level node that consists of a call (typeof 'language', mode 'call') of the `~`() function, and classed as 'formula':

(function(x) c(typeof(x),mode(x),class(x)))(.~.);
## [1] "language" "call"     "formula"

All parse trees can be read and written as a recursive list structure. Here I'll demonstrate that using a nice little recursive function I originally wrote for this answer:

unwrap <- function(x) if (typeof(x) == 'language') lapply(as.list(x),unwrap) else x;
unwrap(Species~.);
## [[1]]
## `~`
##
## [[2]]
## Species
##
## [[3]]
## .
##

In other words, parse trees represent function calls with the function symbol as the first list component, and then all function arguments as the subsequent list components. The special case of a normal formula captures the LHS as the first function argument and the RHS as the second. Hence x[[2L]] represents the LHS symbol of your formula, which we can overwrite directly with a normal assignment to your preferred symbol.

Community
  • 1
  • 1
bgoldst
  • 34,190
  • 6
  • 38
  • 64
0

The following allows for passing in arbitrary data and a target feature to the C50 method:

boosted_trees <- function(data_train, target_feature, iter_choice) {

    target_index <- grep(target_feature, colnames(data_train))
    model_boosted <- C5.0(x = data_train[, -target_index], y = data_train[[target_feature]], trial=iter_choice)
    model_boosted$call$x <- data_train[, -target_index]
    model_boosted$call$y <- data_train[[target_feature]]
    return(model_boosted)

}

The trick is to rename the terms in the method call after building the model so that it can be plotted.

Cybernetic
  • 12,628
  • 16
  • 93
  • 132