Automatic sequence identification in R

Question

I am working in R with some sequential data. Specifically I have a list of integers that appear several times in various sequences. What I am trying to do is to create some code that can identify how many different sequences appear.

Currently, I am doing it manually. I predefine patterns that exist and apply a function that counts the occurrences.

I first use RMYSQL to make the query which is stored in variable product_process_history_joined. Then, I create a list of my data of interest which is stored in the variable data. Then, I define which patterns should my function work on and last I apply my function that counts the number of occurrences.

The code:

product_process_history_joined<-dbGetQuery(con,"SELECT * 
                                       FROM product, process_history
                                       WHERE product.idproduct =    process_history.product_idproduct")

data<-product_process_history_joined$process_types_idprocess_types

pat <- c(1,2,4,5,6)
x <- sapply(1:(length(data)-length(pat)), function(x) all(data[x:     (x+length(pat)-1)] == pat))
route<-data[which(x)]
countR<-length(route)



pat1 <- c(1,2,4,5,7,9,7,7,2,5,6,10)
x <- sapply(1:(length(data)-length(pat1)), function(x) all(data[x:     (x+length(pat1)-1)] == pat1))
route1<-data[which(x)]
countR1<-length(route1)

The dataset that is produced and stored in the data variable looks like this:

[1]  1  4  5  6  1  4  5  6  1  4  5  6  1  4  5  6  1  4  5  6  1  4  5      6  1  4  5  6  1  4  5  6  1  4  5
[36]  6  1  4  5  6  1  4  5  6  1  4  5  6  1  4  5  6  1  4  5  6  1  4   5  6  1  4  5  6  1  4  5  6  1  4
[71]  5  6  1  4  5  6  1  4  5  6  1  4  5  6  1  2  4  5  6 10  1  2  4  5  7  9  7  7  2  5  6 10  1  2  4
[106]  5  6 10  1  2  4  5  6 10  1  2  4  8  1  2  3  5  7  8  1  2  3  5  6  1  2  3  5  6  1  2  4  5  6 10

This is a just a subset of the list. I use around 12 different patterns. The results for the first 2 patterns in the given dataset is 21 for pat and 1 for pat1.

Please create a minimal and reproducible example: http://stackoverflow.com/a/5963610/1412059 Your database query is irrelevant for the issue at hand. — Roland, Jan 25 '16 at 12:42
Perhaps see, also, [this -probably similar- question](http://stackoverflow.com/questions/33027611/how-to-index-a-vector-sequence-within-a-vector-sequence) — alexis_laz, Jan 25 '16 at 13:41

Roland · Accepted Answer · 2016-01-25T18:48:30.387

4

There is no reason for regexing. You could use rollapply:

original_data <- c(1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5,6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4,  5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 2, 4, 5, 6, 10, 1, 2, 4, 5, 7, 9, 7, 7, 2, 5, 6, 10, 1, 2, 4, 5, 6, 10, 1, 2, 4, 5, 6, 10, 1, 2, 4, 8, 1, 2, 3, 5, 7, 8, 1, 2, 3, 5, 6, 1, 2, 3, 5, 6, 1, 2, 4, 5, 6, 10)

pattern2 <- c(1, 4, 5, 6)

library(zoo)

sum(
  rollapply(
    original_data, 
    width = length(pattern2), 
    FUN = function(x, pattern) all(x == pattern), 
    pattern = pattern2
  )
 )
#[1] 21

Faster solutions are possible if necessary, but this offers good readability.

Edit

This extracts all different sequences that start with a 1:

x <- split(original_data, cumsum(original_data == 1))
unique(x)
res <- vapply(unique(x), function(x, y) sum(vapply(y, FUN = identical, y = x, FUN.VALUE = TRUE)), y = x, FUN.VALUE = 1L)
Res <- data.frame(n = res,
                  seq = vapply(unique(x), paste, collapse = ",", FUN.VALUE = "a"))
#   n                      seq
#1 21                  1,4,5,6
#2  4             1,2,4,5,6,10
#3  1 1,2,4,5,7,9,7,7,2,5,6,10
#4  1                  1,2,4,8
#5  1              1,2,3,5,7,8
#6  2                1,2,3,5,6

edited Jan 25 '16 at 18:48

answered Jan 25 '16 at 13:55

Roland

127,288
10
191
288

2

or `sum(rollapply(original_data, length(pattern2), identical, pattern2))` – G. Grothendieck Jan 25 '16 at 14:07
@G.Grothendieck Yes, but I tend to be wary of using `identical`. It wouldn't work if `original_data` or `pattern2` was a named vector. – Roland Jan 25 '16 at 14:16
If that were the case pass it `unname(original_data)` or `unname(pattern2)` depending on which is named. – G. Grothendieck Jan 25 '16 at 14:19
thank you! do you have any ideas on how to do this without define the pattern? I am trying to find an automatic way, meaning that I want the definition of patterns with a bit of code that extracts all the different cases that appear in the dataset. Maybe I should define that every patterns starts from an 1 and ends either in 6 or 10. – dkera Jan 25 '16 at 16:58
I've added a way to do that. – Roland Jan 25 '16 at 18:13
Great! I think this is exactly what I needed – dkera Jan 25 '16 at 19:01

score 1 · Answer 2 · answered Jan 25 '16 at 13:47

This is definitely not the best way to do the job, but you could decide to treat your data as a string and then use regular expressions (via gregexpr).

original_data <- c(1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5,6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4,  5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 2, 4, 5, 6, 10, 1, 2, 4, 5, 7, 9, 7, 7, 2, 5, 6, 10, 1, 2, 4, 5, 6, 10, 1, 2, 4, 5, 6, 10, 1, 2, 4, 8, 1, 2, 3, 5, 7, 8, 1, 2, 3, 5, 6, 1, 2, 3, 5, 6, 1, 2, 4, 5, 6, 10)
data_as_string <- paste(original_data, collapse="-")
pattern1 = "1-2-4-5-6" # Your "pat"
pattern2 = "1-4-5-6" # Occurs 21 times in your data
pattern3 = "1-2-4-5-7-9-7-7-2-5-6-10" # Your "pat1"

gregexpr(pattern1,data_as_string)
# [[1]]
# [1] 169 207 220 273
# attr(,"match.length")
# [1] 9 9 9 9
# attr(,"useBytes")
# [1] TRUE

# So if you just want the number of occurrences
length(gregexpr(pattern1,data_as_string)[[1]])
# [1] 4
length(gregexpr(pattern2,data_as_string)[[1]])
# [1] 21
length(gregexpr(pattern3,data_as_string)[[1]])
# [1] 1

thank you! this works nice. but the initial question is that i don't want to specify manually the patterns. I want an algorithm that detects the different types of patterns first and then counts the occurrences. This means that I want to replace pattern1, pattern 2, pattern3... with a generic bit of code. To help, I just want to say that a pattern starts from 1s and ends in 6 or 10. — dkera, Jan 25 '16 at 17:05

Automatic sequence identification in R

2 Answers2