I'm trying to replicate the procedure proposed here on my data.
target
is the categorical variable that I want to predict while I would force the first split of the classification tree to be done according to split.variable
(categorical too). Due to the object characteristics, indeed, if split.variable
is 1 target
can be only 1, while if it is 0, target
can be 0 or 1. This leads to:
> table(training_set$target, training_set$split.variable)
0 1
0 69 0
1 59 56
I'm able to create tr1
and tr2
(tr3
returns an error [Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
] because -if I'm correct- it's "empty", so no need of it [see also this post]).
tr1 <- ctree(target ~ split.variable, data = training_set, maxdepth = 1) # create the first split at comp_cat
tr2 <- ctree(target ~ split.variable + ., data = training_set, # then the left branch...
subset = predict(tr1, type = "node") == 2)
fix_ids <- function(x, startid = 1L) {
id <- startid - 1L
new_node <- function(x) {
id <<- id + 1L
if(is.terminal(x)) return(partynode(id, info = info_node(x)))
partynode(id,
split = split_node(x),
kids = lapply(kids_node(x), new_node),
surrogates = surrogates_node(x),
info = info_node(x))
}
return(new_node(x))
}
no <- node_party(tr1)
no$kids <- list(
fix_ids(node_party(tr2), startid = 2L)
#, fix_ids(node_party(tr3), startid = 5L)
)
no # visualize the structure
[1] root
| [2] V2 <= 1
| | [3] V15 <= -2.489 *
| | [4] V15 > -2.489 *
mdf <- model.frame(target ~ split.variable + ., data = training_set)
tr <- party(no,
data = mdf,
fitted = data.frame(
"(fitted)" = fitted_node(no, data = mdf),
"(response)" = model.response(mdf),
check.names = FALSE),
terms = terms(mdf), )
but, running party(...)
I get the following error:
Error in kids_node(node)[[i]] : subscript out of bounds
The only reference to such error that I was able to find is this Github issue.
Here the traceback
:
8: is.terminal(node)
7: fitted_node(kids_node(node)[[i]], data, vmatch, obs[indx], perm)
6: fitted_node(no, data = mdf)
5: data.frame(`(fitted)` = fitted_node(no, data = mdf), `(response)` = model.response(mdf),
check.names = FALSE)
4: party(no, data = mdf, fitted = data.frame(`(fitted)` = fitted_node(no,
data = mdf), `(response)` = model.response(mdf), check.names = FALSE),
terms = terms(mdf), )
3: .is.positive.intlike(x)
2: .traceback(x, max.lines = max.lines)
1: traceback(party(no, data = mdf, fitted = data.frame(`(fitted)` = fitted_node(no,
data = mdf), `(response)` = model.response(mdf), check.names = FALSE),
terms = terms(mdf), ))
I don't get if it is an issue related to the missing branch, to mlr
or to any other particular situation related to my data.