0

I am trying to analyze marathon data. I build a simple model and created a decision tree:

fit <- rpart(timeCategory ~ country + age.group + participated.times, data=data)

My goal is to create a universal formula for predicting the results, like in this article (page 4). enter image description here

How can I do it in R, what techniques to use? As a result I would like to have a formula with provided attributes as a sting.

Data: some of the real data that I am using can be downloaded here. Read data as follows:

data = read.table("data/processedData.txt", header=T)
data$timeCategory <- ntile(data$time, 10)
Bob
  • 10,427
  • 24
  • 63
  • 71
  • Look at `?predict.rpart`. It looks like the syntax is `predict(fit,newdata)`. If you want to type out the function as in your image, you'll have to extract the coefficients from `fit` and do some string manipulation. – Frank May 20 '15 at 18:32
  • @Frank can you provide more details or some dummy example? I did not understand. – Bob May 20 '15 at 18:42
  • @Bob: It's your responsibility to provide the example. – IRTFM May 20 '15 at 18:48
  • @BondedDust Added real data. – Bob May 20 '15 at 18:52
  • I just meant that if you want to actually use the prediction function, `?predict.rpart` is the part of the documentation you should look at; and if you want to write the formula, it is an onerous task that begins with examination of the `fit` object, probably with `str(fit)`, looking for where it stores estimated coefficients and regressor names. – Frank May 20 '15 at 18:58
  • Only two out of the four column names in the formula are in that dataset. – IRTFM May 20 '15 at 18:59
  • If you want to provide data, it is generally best to do so in R format and it is also important to provide your desired output. I'd suggest reading BrodieG's answer here: http://stackoverflow.com/a/28481250 – Frank May 20 '15 at 19:00
  • @Frank Desires output is explained in the question - formula. Added reading script. – Bob May 20 '15 at 19:07
  • @BondedDust about column that are in the dataset. Image in the question is just an example. – Bob May 20 '15 at 19:08
  • Not talking about image. Talking about `timeCategory` and `particip.time`. – IRTFM May 20 '15 at 19:13
  • It's really not. You want the prediction formula written as a string over multiple lines with coefficients rounded to four digits; and then converted to a bitmap image or written to a text file? If so, I'll pass and others might as well, but you could begin to look into it yourself by looking at `str(fit)`. I guess you haven't even run the line of code you've provided, to create `fit`... – Frank May 20 '15 at 19:14
  • @Frank you are not right about running the code - OFC I run it. About the output: you are right. Will supplement the question. – Bob May 20 '15 at 19:17
  • @BondedDust Yes, forgot about timeCategory. Supplemented question. partip.time is present in the dataset. – Bob May 20 '15 at 19:19
  • @Frank besides `str(fit)` I aso tried commands such as `printcp(fit)` or `print(fit)` or `summary(fit)`, but could not figure out where to take those coefficients. – Bob May 20 '15 at 19:21
  • The formula offered as the desired output does not match the type of result that rpart produces for a categorical outcome. Ansering based on type of output requested rather than on you suggested code: – IRTFM May 20 '15 at 19:54
  • Note that the paper you refer to in your question does not use the CART algorithm (as implemented in `rpart`) but the M5' algorithm implemented in Weka's `M5P` function. It is easily available in R through the `RWeka` package. – Achim Zeileis May 20 '15 at 22:21

1 Answers1

1

These are the regression coefficients using time as a continuous value, which is the type of prediction being offered in the example. They can be used to build the type of formula you are requesting.

> lmfit <- lm(time ~ country + age.group + particip.time, data=data)
> lmfit

Call:
lm(formula = time ~ country + age.group + particip.time, data = data)

Coefficients:
      (Intercept)      countryJõgeva  countryLääne-Viru        countryLäti  
         9526.702            345.930            122.513            -73.239  
     countryLeedu       countryPärnu       countryRapla    countrySaaremaa  
          120.592            -78.086           -208.882            114.292  
   countryTallinn       countryTartu    countryViljandi       age.groupM20  
          -37.536             55.771            -70.417           -142.600  
     age.groupM21       age.groupM35       age.groupM40       age.groupM45  
         -218.225           -218.067            -20.108           -196.331  
     age.groupM50      particip.time  
           88.342             -2.487  

If you want them all lined up then:

> as.matrix(coef(lmfit))
                         [,1]
(Intercept)       9526.702146
countryJõgeva      345.930334
countryLääne-Viru  122.513294
countryLäti        -73.239333
countryLeedu       120.591585
countryPärnu       -78.086107
countryRapla      -208.882244
countrySaaremaa    114.291592
countryTallinn     -37.535659
countryTartu        55.771326
countryViljandi    -70.416659
age.groupM20      -142.599598
age.groupM21      -218.224754
age.groupM35      -218.066655
age.groupM40       -20.108242
age.groupM45      -196.331263
age.groupM50        88.341978
particip.time       -2.486818

Further processing to text:

> form <- as.matrix(coef(lmfit))
> rownames(form) <- gsub("try", "try == ", rownames(form) )
> rownames(form) <- gsub("oup", "oup == ", rownames(form) )
> form
                             [,1]
(Intercept)           9526.702146
country == Jõgeva      345.930334
country == Lääne-Viru  122.513294
country == Läti        -73.239333
country == Leedu       120.591585
country == Pärnu       -78.086107
country == Rapla      -208.882244
country == Saaremaa    114.291592
country == Tallinn     -37.535659
country == Tartu        55.771326
country == Viljandi    -70.416659
age.group == M20      -142.599598
age.group == M21      -218.224754
age.group == M35      -218.066655
age.group == M40       -20.108242
age.group == M45      -196.331263
age.group == M50        88.341978
particip.time           -2.486818

Almost complete:

cat(paste( form, paste0("(", rownames(form), ")" ), sep="*", collapse="+\n") )

9526.70214596473*((Intercept))+
345.93033373724*(country == Jõgeva)+
122.51329418344*(country == Lääne-Viru)+
-73.2393326763322*(country == Läti)+
120.591584530399*(country == Leedu)+
-78.0861070429056*(country == Pärnu)+
-208.882244416016*(country == Rapla)+
114.291592299937*(country == Saaremaa)+
-37.5356589458207*(country == Tallinn)+
55.771326363022*(country == Tartu)+
-70.4166587941724*(country == Viljandi)+
-142.599598141679*(age.group == M20)+
-218.224754448193*(age.group == M21)+
-218.066655292225*(age.group == M35)+
-20.1082422022072*(age.group == M40)+
-196.33126335145*(age.group == M45)+
88.3419781798024*(age.group == M50)+
-2.48681789339678*(particip.time)
IRTFM
  • 258,963
  • 21
  • 364
  • 487