3

I need to plot a conditional inference tree. I have selected the party::ctree() function. It works on the iris dataset.

library(party)
(irisct_party <- party::ctree(Species ~ .,data = iris))
plot(irisct_party)

enter image description here

But when I using the random data

library(wakefield)
set.seed(123)
n=200
studs <- data.frame(problem = factor(answer(n, x = c("No", "Yes"))),
                    age     = round(runif(n, 18, 25)),
                    gender  = factor(answer(n, x = c("M",   "F" ))),
                    smoker  = factor(answer(n, x = c("No",  "Yes" ))),
                    before  = round(runif(n, 60, 80)),
                    after   = before + round(runif(n, 10, 20))
)

(ct <-  party::ctree(problem ~ ., data = studs))
plot(ct)

I see just

Conditional inference tree with 1 terminal nodes

Response:  problem 
Inputs:  age, gender, smoker, before, after 
Number of observations:  200 

1)*  weights = 200 

Question. Why is the conditional inference tree has 1 terminal node on random data?

Nick
  • 1,086
  • 7
  • 21
  • 1
    The `party` function `ctree` is able to determine a lot...if it finds patterns. To see what I mean you could use something like `randomForest::randomForest` and look at the performance. For the `iris` data, the fit is around 95% explained. However, for your random data, the fit is closer to 50% explained. It's a conditional inference tree, but it wasn't able to determine conditional inferences that suitably represent your data. Does that make sense? – Kat Feb 17 '22 at 15:36

1 Answers1

2

In each node (including the root node), ctree() conducts an independence test for the dependent variable (problem in your random data) and each of the explanatory variables (age, gender, smoker, before, after). It computes the p-value for each of of the tests and selects the explanatory variable with the lowest p-value for splitting. But only if that p-value is significant at a certain significance level (adjusted for testing multiple explanatory variables). In your data this is not the case because, in fact, the dependent variable has been sampled independently from the explanatory ones. Therefore, the algorithm stops and does not split the root node.

Remarks: It is recommended to use the successor package partykit rather than party for fitting ctree(). See also the accompanying vignette("ctree", package = "partykit") for further details.

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • 1
    Thank you for answer. I have read the your paper https://www.tandfonline.com/doi/abs/10.1198/000313006X118430 and looking for how to generate depended data. – Nick Feb 18 '22 at 03:24
  • I have returned to the iris tree and see that the Pental.Length was used twice. Is it typical situation for the ctree algorithm? One variable was used two times. – Nick Feb 23 '22 at 05:37
  • It's not unusual. As the algorithm has constant fits in the terminal nodes, it often fits piecewise-constant step functions to approximate nonlinear relationships. – Achim Zeileis Feb 23 '22 at 08:46