2

In my data there are only missing data (*) on the right side of the sequences. That means that no sequence starts with * and no sequence has any other markers after *. Despite this the PST (Probabilistic Suffix Tree) seems to predict a 90% chance of starting with a *. Here's my code:

# Load libraries
library(RCurl)
library(TraMineR)
library(PST)

# Get data
x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)

# Load and transform data
data <- read.table("thread_level.csv", sep = ",", header = F, stringsAsFactors = F)

# Create sequence object
data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= NA, nr = "*")

# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = TRUE)

# Look at first state
cmine(S1, pmin = 0, state = "N3", l = 1)

This generates:

[>] context: e 
            EX         FA         I1         I2          I3          N1              N2          N3        NR
S1 0.006821066 0.01107234 0.01218274 0.01208756 0.006821066 0.002569797     0.003299492 0.001554569 0.0161802
           QU          TR         *
S1 0.01126269 0.006440355 0.9097081

How can the probability for * be 0.9097081 at the very beginning of the sequence, meaning after context e?

Does it mean that the context can appear anywhere inside a sequence, and that e denotes an arbitrary starting point somewhere inside a sequence?

histelheim
  • 4,938
  • 6
  • 33
  • 63

1 Answers1

2

A PST is a representation of a variable length Markov model (VLMC). As a classical Markov model a VLMC is assumed to be homogeneous (or stationary) meaning that the conditional probabilities of the outcome given the context are the same at each position in the sequence. In other words, the context can appear anywhere in the sequence. Actually, the search for contexts is done by exploring the tree that is supposed to apply anywhere in the sequences.

In your example, for l=1 (l is 1 + the length of the context), you look only for 0-length context, i.e., the only possible context is the empty sequence e. Your condition pmin=0, state=N3 (have a probability greater than 0 for N3) is equivalent to no condition at all. So you get the overall probability to observe each state. Because your sequences (with the missing states) are all of the same length, you would get the same results using TraMineR with

seqmeant(data.seq, with.missing=TRUE)/max(seqlength(data.seq))

To get the distribution at the first position, you can use TraMineR and look at the first column of the table of cross-sectional distributions at the successive positions returned by

seqstatd(data.seq, with.missing=TRUE)

Hope this helps.

Gilbert
  • 3,570
  • 18
  • 28
  • Does this mean that when I search for a context with `L>1`, `PST` does not necessarily restrict itself to the beginning of the sequence, but rather tries to find the context wherever it appears, and then show the conditional probabilities of the next state? E.g. I might look for `EX-FA`, and then `PST` would identify all instances of `EX-FA`, *no matter where they appear in the sequences*, and then give the conditional probabilities of the next state after `EX-FA`? – histelheim Jan 27 '17 at 14:10
  • Yes, it tries to find the context wherever it appears (actually it explores the tree that is supposed to apply anywhere). With `cmine` you get all the contexts (and their associated conditional probabilities of the next state) under the constraints defined with the `pmin`, `pmax`, `state`, and `l` arguments. – Gilbert Jan 27 '17 at 14:25