2

Is it possible to get a p-value for nodes in a categorical tree analysis with R? I am using rpart and can't locate a p-value for each node. Maybe this is only possible with a regression and not categories.

structure(list(subj = c(702L, 702L, 702L, 702L, 702L, 702L, 702L, 
702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 702L, 
702L, 702L, 702L, 702L, 702L, 702L), visit = c(4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L), run = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 
4L), .Label = c("A", "B", "C", "D", "E", "xdur", "xend60", "xpre"
), class = "factor"), ho = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), hph = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), longexer = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("10min", "60min"), class = "factor"), 
    esq_sick = c(NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, 
    NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA), esq_sick2 = c(NA, 
    NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, 
    NA, NA, 0L, NA, NA, NA, NA, NA), ll_sick = c(NA, NA, 0L, 
    NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 
    0L, NA, NA, NA, NA, NA), ll_sick2 = c(NA, NA, 0L, NA, NA, 
    NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, 
    NA, NA, NA, NA), esq_01 = c(NA, NA, 2L, NA, NA, NA, NA, NA, 
    NA, NA, 2L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 
    NA), esq_02 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 2L, 
    NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), esq_03 = c(NA, 
    NA, 0L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, 
    NA, NA, 0L, NA, NA, NA, NA, NA), esq_04 = c(NA, NA, 0L, NA, 
    NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, 
    NA, NA, NA, NA, NA), esq_05 = c(NA, NA, 0L, NA, NA, NA, NA, 
    NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, 
    NA, NA), esq_06 = c(NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, 
    1L, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), 
    esq_07 = c(NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, 
    NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), esq_08 = c(NA, 
    NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, 
    NA, NA, 0L, NA, NA, NA, NA, NA), esq_09 = c(NA, NA, 0L, NA, 
    NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, 
    NA, NA, NA, NA, NA), esq_10 = c(NA, NA, 0L, NA, NA, NA, NA, 
    NA, NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, 
    NA, NA)), .Names = c("subj", "visit", "run", "ho", "hph", 
"longexer", "esq_sick", "esq_sick2", "ll_sick", "ll_sick2", "esq_01", 
"esq_02", "esq_03", "esq_04", "esq_05", "esq_06", "esq_07", "esq_08", 
"esq_09", "esq_10"), row.names = 7:30, class = "data.frame")



alldata = read.table('symptomology CSV2.csv',header=TRUE,sep=",")

library(rpart)

fit <- rpart(esq_sick2~esq_01_bin + esq_02_bin + esq_03_bin + esq_04_bin + esq_05_bin + esq_06_bin + esq_07_bin + esq_08_bin + esq_09_bin + esq_10_bin + esq_11_bin + esq_12_bin + esq_13_bin + esq_14_bin + esq_15_bin + esq_16_bin + esq_17_bin + esq_18_bin + esq_19_bin + esq_20_bin, method="class", data=alldata)

plot(fit, uniform = FALSE, branch = 1, compress = FALSE, nspace, margin = 0.1, minbranch = 0.3)
text(fit, use.n=TRUE, all=TRUE, cex=.8)
dana
  • 21
  • 4
  • You could improve this question and possibly forestall further close votes by adding a simple example and specifying a particular "node" that was of interest. – IRTFM Jan 04 '15 at 19:29
  • I'm not sure what you mean. You want a picture of the tree? I am interested in all the nodes – dana Jan 04 '15 at 20:10
  • An example would be data and code to produce the object of interest. I do NOT mean a "picture". If you just want a tutorial on inference with tree structured discrimination then you should delete this question and repost this at CrossValidated.com. SO is all about code and data. – IRTFM Jan 04 '15 at 20:21
  • have to try to edit OP cause wont let me put multiple lines here – dana Jan 04 '15 at 20:36
  • Take a look at the [vignette](http://cran.r-project.org/web/packages/partykit/vignettes/constparty.pdf) for the `party` package and search for "p-value". That might provide what you're looking for. – eipi10 Jan 04 '15 at 20:38
  • I did. No go on categorical or missing data – dana Jan 04 '15 at 20:39
  • 1
    For future reference, when people on SO ask you for data and code, they generally mean a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). That is, a minimal example that includes data and code sufficient to either reproduce your problem or demonstrate what you're looking to achieve, and that shows what you've tried so far. – eipi10 Jan 04 '15 at 20:41
  • Sorry, I am trying to provide that. No idea how to provide you with hundreds of data observations – dana Jan 04 '15 at 20:46
  • The goal isn't to provide *all* your data. Just the smallest number of columns and rows necessary to demonstrate your problem or what you're looking for. – eipi10 Jan 04 '15 at 20:49
  • For example, if ten rows and five columns (the outcome variable plus four predictor variables) will get the job done, you can paste in the output of `dput(alldata[1:10, 1:5]))`. That's just an example. Just provide a small portion of your data set and some code that will give us something to work with. – eipi10 Jan 04 '15 at 20:52
  • Thanks for providing some data. I see that most of the rows are missing all values for the `esq` columns. Can you provide just some rows for which the outcome variable is not missing? – eipi10 Jan 04 '15 at 21:26

1 Answers1

7

Here's an example that might help you. I'm using the built-in airquality data set and the example provided in the help for ctree:

library(partykit)
# For the sctest function to extract p-values (see help for ctree and sctest)
library(strucchange)

# Data we'll use
airq <- subset(airquality, !is.na(Ozone))

# Build the tree
airct <- ctree(Ozone ~ ., data = airq)

Look at the tree:

airct
Model formula:
Ozone ~ Solar.R + Wind + Temp + Month + Day

Fitted party:
[1] root
|   [2] Temp <= 82
|   |   [3] Wind <= 6.9: 55.600 (n = 10, err = 21946.4)
|   |   [4] Wind > 6.9
|   |   |   [5] Temp <= 77: 18.479 (n = 48, err = 3956.0)
|   |   |   [6] Temp > 77: 31.143 (n = 21, err = 4620.6)
|   [7] Temp > 82
|   |   [8] Wind <= 10.3: 81.633 (n = 30, err = 15119.0)
|   |   [9] Wind > 10.3: 48.714 (n = 7, err = 1183.4)

Extract the p-values:

sctest(airct)

$`1`
              Solar.R         Wind         Temp     Month        Day
statistic 13.34761286 4.161370e+01 5.608632e+01 3.1126596 0.02011554
p.value    0.00129309 5.560572e-10 3.468337e-13 0.3325881 0.99998175

$`2`
            Solar.R         Wind         Temp     Month      Day
statistic 5.4095322 12.968549828 11.298951405 0.2148961 2.970294
p.value   0.0962041  0.001582833  0.003871534 0.9941976 0.357956

$`3`
NULL

$`4`
              Solar.R     Wind         Temp      Month       Day
statistic 9.547191843 2.307676 11.598966936 0.06604893 0.2513143
p.value   0.009972755 0.497949  0.003295072 0.99965679 0.9916670

$`5`
             Solar.R      Wind      Temp     Month       Day
statistic 6.14094026 1.3865355 1.9986304 0.8268341 1.3580462
p.value   0.06432172 0.7447599 0.5753799 0.8952749 0.7528481

$`6`
            Solar.R       Wind      Temp    Month       Day
statistic 5.1824354 0.02060939 0.9270013 0.165171 4.6220522
p.value   0.1089932 0.99998062 0.8705785 0.996871 0.1481643

$`7`
            Solar.R         Wind       Temp     Month        Day
statistic 0.8083249 11.711564549 6.77148538 0.1307643 0.03992875
p.value   0.8996614  0.003101788 0.04546281 0.9982052 0.99990034

$`8`
            Solar.R      Wind      Temp       Month         Day
statistic 0.9056479 3.1585094 2.9285252 0.008106707 0.008686293
p.value   0.8759687 0.3247585 0.3657072 0.999998099 0.999997742

$`9`
NULL
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Thanks. I can't get it to work with "tree". It says missing data – dana Jan 04 '15 at 21:00
  • esqtree <- ctree(esq_sick2 ~ esq_01_bin + esq_02_bin + esq_03_bin + esq_04_bin + esq_05_bin + esq_06_bin + esq_07_bin + esq_08_bin + esq_09_bin + esq_10_bin + esq_11_bin + esq_12_bin + esq_13_bin + esq_14_bin + esq_15_bin + esq_16_bin + esq_17_bin + esq_18_bin + esq_19_bin + esq_20_bin, data=alldata) Error in model@dpp(...) : missing values in response variable not allowed – dana Jan 04 '15 at 21:03
  • @dana, read the error message... You have missing values in your response variable – David Arenburg Jan 04 '15 at 21:04
  • @eipi10, that `strucchange` package is quite interesting. I'm using `ctree` a lot in my daya to day work, and never heard of this package before. – David Arenburg Jan 04 '15 at 21:07
  • Yes, I am aware I have missing data. That is correct – dana Jan 04 '15 at 21:13
  • @dana You should exclude rows with a missing response. You can't include those rows in the model-fitting function anyway. – eipi10 Jan 04 '15 at 21:38
  • rpart excludes them for you – dana Jan 04 '15 at 22:14
  • I was able to make a tree using ctree and it does not produce the same tree as rpart so i have no idea which is the correct one now – dana Jan 04 '15 at 22:43
  • They're probably using different algorithms or different portions of your data (depending on how they deal with missing values) or both. There's no "correct" tree, just trees that are optimal under given criteria. If you're doing prediction, you might try a few different methods and tuning parameters and pick the best-performing one for your data using cross-validation. – eipi10 Jan 04 '15 at 22:46
  • @dana, see [here](http://stats.stackexchange.com/questions/22035/party-vs-rpart-vs-for-partitioning-trees-in-r) and [here](http://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees) – David Arenburg Jan 04 '15 at 23:31
  • Thank you David. I found that earlier and it answered my question. rpart has no p-values. – dana Jan 06 '15 at 00:53