1

I'm trying to replicate the procedure proposed here on my data but I get the following error:

Error in interval.numeric(x, breaks = c(xmin - tol, ux, xmax)) : 
  invalid number of intervals

target is the categorical variable that I want to predict while I would force the first split of the classification tree to be done according to split.variable (categorical too). Due to the object characteristics, indeed, if split.variable is 1 target can be only 1, while if it is 0, target can be or 0 or 1. Initially I treated them as factors but I changed them to numeric and then rounded (as suggested in other posts in SO). Unfortunately, none of these solutions were helpful. I played a bit with the data, subsampling cols and rows but still it doesn't work. What am I missing?

Here is an MRE to replicate the error:

library(partykit)

tdf = structure(list(target = c(0, 0, 0, 1, 0, 0, 1, 1, 1, 1), split.variable = c(0, 
0, 0, 0, 1, 0, 0, 0, 0, 0), var1 = c(2.021, 1.882, 1.633, 3.917, 
2.134, 1.496, 1.048, 1.552, 1.65, 3.112), var2 = c(97.979, 98.118, 
98.367, 96.083, 97.866, 98.504, 98.952, 98.448, 98.35, 96.888
), var3 = c(1, 1, 1, 0.98, 1, 1, 1, 1, 1, 1), var4 = c(1, 1, 
1, 0.98, 1, 1, 1, 1, 1, 1), var5 = c(18.028, 25.207, 20.788, 
28.548, 18.854, 19.984, 27.352, 24.622, 25.037, 24.067), var6 = c(0.213, 
0.244, 0.289, 0.26, 0.887, 0.575, 0.097, 0.054, 0.104, 0.096), 
    var7 = c(63.22, 59.845, 62.45, 63.48, 52.143, 51.256, 56.296, 
    57.494, 59.543, 68.434), var8 = c(0.748, 0.795, 0.807, 0.793, 
    0.901, 0.909, 0.611, 0.61, 0.618, 0.589)), row.names = c(6L, 
7L, 8L, 9L, 11L, 12L, 15L, 16L, 17L, 18L), class = "data.frame")

tr1 <- ctree(target ~ split.variable,     data = tdf, maxdepth = 1)
tr2 <- ctree(target ~ split.variable + ., data = tdf, subset = predict(tr1, type = "node") == 2)
Nico
  • 191
  • 1
  • 6

1 Answers1

1

Your data set is too small to do what you want:

  • With just 10 observations tr1 does not lead to any splits but produces a tree with a single root node.
  • Consequently, predict(tr1, type = "node") produces a vector of 10 times 1.
  • Thus, the subset with predict(tr1, type = "node") == 2 is empty (all FALSE).
  • This leads to an (admittedly cryptic) error message, reflecting that you cannot learn a tree from an empty data set.

Additionally: I'm not sure where you found the recommendation to use numeric codings of categorical variables. But for partykit you are almost always better off coding categorical variables appropriately as factor variables.

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • the numeric thing comes from a post of which I lost the link (I'll try to check again if I find it). BTW, my fault: there was a sneaky error in the main dataset that ruined everything. Unfortunately, I got stuck again at the "last" step. [if you can] see: https://stackoverflow.com/questions/74476666/error-in-kids-nodenodei-subscript-out-of-bounds-in-partykit – Nico Nov 17 '22 at 13:54
  • 1
    Please accept this answer (by clicking on the check mark on the left of it) so that it is flagged as "resolved" here on StackOverflow. I'll try to have a look at your other post in the next days. – Achim Zeileis Nov 17 '22 at 16:15