How to identify sequences within each leaf from a regression tree?

Question

Using the biofam dataset

library(TraMineR)
data(biofam)
lab <- c("P","L","M","LM","C","LC","LMC","D")
biofam.seq <- seqdef(biofam[,10:25], states=lab)
head(biofam.seq)

 Sequence                                    
1167 P-P-P-P-P-P-P-P-P-LM-LMC-LMC-LMC-LMC-LMC-LMC
514  P-L-L-L-L-L-L-L-L-L-L-LM-LMC-LMC-LMC-LMC    
1013 P-P-P-P-P-P-P-L-L-L-L-L-LM-LMC-LMC-LMC      
275  P-P-P-P-P-L-L-L-L-L-L-L-L-L-L-L             
2580 P-P-P-P-P-L-L-L-L-L-L-L-L-LMC-LMC-LMC       
773  P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P

I can fit and display a regression tree:

seqt <- seqtree(biofam.seq~sex + birthyr, data=biofam)

seqtreedisplay(seqt, type="I", border=NA, withlegend= TRUE, legend.fontsize=2, legendtext = "Biofam Regression Tree")

Then I can identify the leaf memberships:

seqt$fitted[,1]

This, however, is where I get confused. How do I know which leaf number corresponds to which leaf in the plot? The graph does not seem to display it, and running print(seqt) does not seem to give leaf numbers either.

What I would like to achieve is to separate out the sequences in each leaf, so that I can run descriptives on each leaf separately. How can I accomplish this?

score 3 · Answer 1 · answered Oct 30 '14 at 07:46

Currently, this information can not be easily recovered from the tree. The following function return a vector of the fitted values using full condition of the tree instead of the node label.

dtlabels <- function(tree){
    if (!inherits(tree, "disstree")) {
        stop("tree should be a disstree object")
    }

    split_s <- function(sp){
        formd <- function (x){
            return(format(x, digits =getOption("digits")-2))
        }
        str_split <- character(2)
        vname <- colnames(tree$data)[sp$varindex]
        if (!is.null(sp$breaks)) {
            str_split[1] <- paste("<=", formd(sp$breaks))
            str_split[2] <- paste(">", formd(sp$breaks))
        }
        else {
            str_split[1] <- paste0("[", paste(sp$labels[sp$index==1], collapse=", "),"]")
            str_split[2] <- paste0("[", paste(sp$labels[sp$index==2], collapse=", "),"]")
        }
        if(!is.null(sp$naGroup)){
            str_split[sp$naGroup] <- paste(str_split[sp$naGroup], "with NA")
        }
        return(paste(vname, str_split))
    }
    labelEnv <- new.env()
    labelEnv$label <- list()
    addLabel <- function(n1, n2, val){
        id1 <- as.character(n1$id)
        id2 <- as.character(n2$id)
        labelEnv$label[[id2]] <- c(labelEnv$label[[id1]], val)
    }
    nodeRec <- function(node){
        if(!is.null(node$split)){
            spl <- split_s(node$split)
            addLabel(node, node$kids[[1]], spl[1])
            addLabel(node, node$kids[[2]], spl[2])
            nodeRec(node$kids[[1]])
            nodeRec(node$kids[[2]])
        }
    }
    nodeRec(tree$root)
    l2 <- list()
    for(nn in names(labelEnv$label)){
        l2[[nn]] <- paste0(labelEnv$label[[nn]], collapse=" & ")
    }
    l3 <- as.character(l2)
    names(l3) <- names(l2)
    return(factor(factor(tree$fitted[, 1], levels=as.numeric(names(l3)), labels=l3)))

}

This function can then be used in the following manner.

fitted <- dtlabels(seqt)
print(table(fitted))

Hope this helps!

score 2 · Answer 2 · answered Oct 23 '14 at 13:03

Actually, you are looking for the rules defined by the tree. You see them by looking at the the tree.

For example, the left most branch of your example seqt defines the rule:

birthyr <= 1940 & birthyr <= 1928

and the bottom left most leaf is defined by

birthyr <= 1940 & birthyr > 1928 & sex == "man"

I am afraid, however. that you are right. The disstree object returned by TraMineR (your seqt) does currently not explicitly contain that information. Perhaps in a further version.

How to identify sequences within each leaf from a regression tree?

2 Answers2