Getting log-likelihood from probabilistic suffix tree

Question

Here is my code:

library(RCurl)
library(TraMineR)
library(PST)

x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)

# Load and transform data
data <- read.table("thread_level.csv", sep = ",", header = F, stringsAsFactors = F)

data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = "NA", right = "*")

# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = F)
logLik(S1)

For some reason, it refuses to return a Log-likelihood value? Why is this the case? How can I get a Log-likelihood value?

Gilbert · Accepted Answer · 2017-01-27T07:59:35.473

You have bad values for the missing and right arguments in your seqdef command which then causes an error in pstree.

With

data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= NA, nr = "*")
# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = TRUE)
logLik(S1)

we get

'log Lik.' -31011.32 (df=47179)

Note that since you have missing values I have set with.missing = TRUE in the pstree command.

===============

To ignore the right missings, set right='DEL' in seqdef.

seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= "DEL")
S2 <- pstree(seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = F)
logLik(S2)

I don't know what PST computes as logLik(S2) and why we get here an NA. The likelihood to generate the data with the tree S2 can be obtained by means of the predict function that returns the likelihood of each sequence in the data. The log likelihood of the data should then be

sum(log(predict(S2, seq)))

which gives

 [>] 984 sequence(s) - min/max length: 1/32
 [!] sequences have unequal lengths
 [>] max. context length: L=6
 [>] found 1020 distinct context(s)
 [>] total time: 0.588 secs
[1] -4925.79

However, all of my missing values are on the right--there's not a single missing value "inside" of the sequences. Hence, roughly 90% of all my events are `*`. Won't this skew the probabilities? — histelheim, Jan 26 '17 at 14:02
For example, when I do `cmine(S1, pmin = 0, state = "good_idea", l = 1)` I get a probability of `e` that is 47%--this should not even be possible with this dataset, since no sequence starts with `*`. — histelheim, Jan 26 '17 at 16:59
I edited the answer to show how to ignore the missing end of the sequences. And, your are right, we get then an `NA` logLik. I suggest you ask the author of the package about what this `logLik` is supposed to be (it does not return the same value as the method I propose in the answer!). — Gilbert, Jan 27 '17 at 08:07

score 2 · Answer 2 · edited Feb 02 '17 at 17:37

Indeed, there was a problem when computing likelihood of models fitted to sequences of unequal lengths. This is fixed. The new version of the PST package (0.94) will be available within a few hours on R-Forge, to install:

install.packages("PST", repos="http://R-Forge.R-project.org")

and later on CRAN.

Note that since your sequences don't contain any missing values but are of unequal lengths, you don't have to set neither with.missing=TRUE when using the pstree function nor any option when using seqdef.

Now when running the following code:

library(RCurl)
library(TraMineR)
library(PST)

x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)

data.seq <- seqdef(data[2:nrow(data),2:ncol(data)])

# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6)

I get:

> S1@logLik
[1] -4925.79

See comment in question: http://stackoverflow.com/questions/41968530/fitting-a-vlmc-to-very-long-sequences - not recognizing the missing values causes issues in `tune()` and `cmine()` - how can I get around this? — histelheim, Feb 24 '17 at 14:48

Getting log-likelihood from probabilistic suffix tree

2 Answers2