7

I'm using R package TraMineR to make some academic research on sequence analysis.

I want to find a pattern defined as someone being in the target company, then going out, then coming back to the target company.

(simplified) I've define state A as target company; B as outside industry company and C as inside industry company.

So what I want to do is find sequences with the specific patterns A-B-A or A-C-A.

After looking at this question (Strange number of subsequences? ) and reading the user guide, specially the following passages:

4.3.3 Subsequences A sequence u is a subsequence of x if all successive elements ui of u appear >in x in the same order, which we simply denote by u x. According to this denition, unshared >states can appear between those common to both sequences u and x. For example, u = S; M is a >subsequence of x = S; U; M; MC.

and

7.3.2 Finding sequences with a given subsequence The seqpm() function counts the number of sequences that contain a given subsequence and collects their row index numbers. The function returns a list with two elements. The rst element, MTab, is just a table with the number of occurrences of the given subsequence in the data. Note that only one occurrence is counted per sequence, even when the sub-sequence appears more than one time in the sequence. The second element of the list, MIndex, gives the row index numbers of the sequences containing the subsequence. These index numbers may be useful for accessing the concerned sequences (example below). Since it is easier to search a pattern in a character string, the function rst translates the sequence data in this format when using the seqconc function with the TRUE option.

I concluded that seqpm() was the function I needed to get the job done.

So I have sequences like: A-A-A-A-A-B-B-B-B-B-A-A-A-A-A

And out of the definition of subsequences that i found on the mentiod sources, i figure I could find that kind of sequence by using:

seqpm(sequence,"ABA")

But that does not happen. In order to find that example sequence i need to input

seqpm(sequence,"ABBBBBA")

which is not very useful for what I need.

  1. So do you guys see where I might've missed something ?
  2. How can I retrieve all the sequences that do go from A to B and Back to A?
  3. Is there a way for me to find go from A to anything else and then back to A ?

Thanks a lot !

Pedro Braz
  • 2,261
  • 3
  • 25
  • 48

1 Answers1

4

The title of the seqpm help page is "Find substring patterns in sequences", and this is what the function actually does. It searches for sequences that contain a given substring (not a subsequence). Seems there is a formulation error in the user's guide.

A solution to find the sequences that contain given subsequences, is to convert the state sequences into event sequences with seqecreate , and then use the seqefsub and seqeapplysub function. I illustrate using the actcal data that ships with TraMineR.

library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal[,13:24])

## displaying the first state sequences
head(actcal.seq)

## transforming into event sequences
actcal.seqe <- seqecreate(actcal.seq, tevent = "state", use.labels=FALSE)

## displaying the first event sequences
head(actcal.seqe)

## now searching for the subsequences
subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)","(D)-(B)"))
## and identifying the sequences that contain the subsequences
subs.pres <- seqeapplysub(subs, method="presence")
head(subs.pres)

## we can now, for example, count the sequences that contain (A)-(D)
sum(subs.pres[,1])
## or list the sequences that contain (A)-(D)
rownames(subs.pres)[subs.pres[,1]==1]

Hope this helps.

Gilbert
  • 3,570
  • 18
  • 28
  • Oh I wish they had a tool for making that search directly on states sequences. I'll email them about this suggestion and the confusion on the user guide. Thanks it was very helpful ! – Pedro Braz Jan 23 '15 at 16:32
  • I'm having a rough time trying to understand what is the response format ob seqapplysub(). My problem is that when i check the subs object it says count=6, but then when I check subs.pres object it prints a bunch of lines (much more than the count). I changed this line of code: subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)-(B)")) – Pedro Braz Jan 23 '15 at 19:08
  • 1
    `seqeapplysub` returns a matrix with a row for each original sequence and a column for each subsequence you are searching for (there are two in my example above). The 1's in the column indicate sequences that contain the corresponding subsequence, and 0's those that don't. So the count returned by `seqefsub` (6, in your case) corresponds to the number of 1's in the column. – Gilbert Jan 23 '15 at 19:30
  • Thanks ! is there a way to get their indexes ? – Pedro Braz Jan 23 '15 at 19:32
  • 1
    Just use `which`. For example, `index <- which(subs.pres[,1]==1)` – Gilbert Jan 24 '15 at 07:21
  • I know this question is pretty old but maybe you could help me. when I use seqefsub it returns a list of sequences in which the subsequence was searched. In my case that number is way inferior to the total number of sequences. Can you think of any reason for that ? – Pedro Braz Aug 26 '16 at 20:08
  • My point is, now that I have the subsequences index I want to look for them in the original dataset. But they are not the same length – Pedro Braz Aug 26 '16 at 20:12