Count the number of occurring a specific ordered sequence in R

Question

I have this vector

data<-c(3,1,1,3,1,1,1,1,2,1,1,3,3,3,1,3,1,1,3,2,1,3,3,3,3)

I need to find the number of times I can have 1, then 2, then 3 (in this particular order)

So the expected answer for the above vector is 98 times (all possible ways).

Is there any efficient way to do so, as my actual problem will be a vector with many unique values (not simply as 1,2,3).

and here is my codes that give me the answer

data<-c(3,1,1,3,1,1,1,1,2,1,1,3,3,3,1,3,1,1,3,2,1,3,3,3,3)
yind<-which(data==2)
y1<-yind[1]
y2<-yind[2]
sum(data[1:y1]<data[y1])*sum(data[y1:length(data)]>data[y1])+sum(data[1:y2]<data[y2])*sum(data[y2:length(data)]>data[y2])

but it is not suitable for a vector with many unique values.For example

set.seed(3)
data2 <- sample(1:5,100,replace = TRUE)

and then count how many times I can have 1, then 2, then 3, then 4, then 5 (all possible ways).

Thank you

How much more efficient than the solution you have already written? — Scott Hunter, May 15 '20 at 15:54
[This](https://stackoverflow.com/questions/11095992/generating-all-distinct-permutations-of-a-list-in-r) might be of interest. — NelsonGon, May 15 '20 at 15:54
Could you please describe in more detail what you need? The way I see it there is only one way of getting 1, 2, 3 *in this particular order*. — Jan, May 15 '20 at 16:05
@Jan if you work from left to right and you count all the possible ways you can obtain 1,2,3. As we have only two 2 here, it is easy to count as 6s one x 9s three + 11s ones x 4s three =98. But when you have more than 3 numbers it becomes difficult. — TcM, May 15 '20 at 20:10
@Scott Hunter, by efficient means I can calculate it on my computer at least, for large numbers it is already a challenge. — TcM, May 15 '20 at 20:10

chinsoon12 · Accepted Answer · 2020-05-17T22:33:08.213

1

Here is an option using non-equi joins from data.table:

library(data.table)
v <- data2
tofind <- 1L:5L
dat <- data.table(rn=seq_along(v), v)

paths <- dat[v==tofind[1L]][, npaths := as.double(1)]
for (k in tofind[-1L]) {
    paths <- paths[dat[v==k], on=.(rn<rn), allow.cartesian=TRUE, nomatch=0L, 
        by=.EACHI, .(npaths=sum(npaths))]
}
paths[, sum(npaths)]

Output for your data is 98. Output for your data2 is 20873.

—- Explanation: Picture a n-nomial tree where each layer is the sequence of numbers that you are looking for and each vertex is the position of numbers in the data vector. For example, for data = c(1,2,1,2,3) the tree would look like

So the code goes through each layer and find the numbers of paths going into each vertex on that layer. The code uses a non-equi inner join to find those paths going into the vertices.

edited May 17 '20 at 22:33

answered May 15 '20 at 22:50

chinsoon12

25,005
4
25
35

Thank you very much, I am still trying to understand the codes. When I increase the sample size to 1000, I got an error. – TcM May 15 '20 at 23:43
Yes. prob have to break it down somehow. I need simply the proportion, meaning need to divide this number over all possible combinations (so not in that specific order 1<2<... – TcM May 16 '20 at 00:21
Thank you, it is super fast. But I am not sure if I am doing something wrong, when k=7 or above (not 5 as above in your example) and for sample size 1000. I got this error below. Warning message: In gsum(npaths) : The sum of an integer column for a group was more than type 'integer' can hold so the result has been coerced to 'numeric' automatically for convenience. – TcM May 16 '20 at 01:31
wah u are dealing with really large numbers. its integer overflow. let me update. 193478053782? – chinsoon12 May 16 '20 at 01:38
Wow, it seems to work, thank you very much. The next step for me is to understand the codes. Thanks a lot. – TcM May 16 '20 at 11:01
@TcM i refactor to use one `[`. should be faster and less memory intensive – chinsoon12 May 17 '20 at 10:13
Thank you very much @chinsoon12, much appreciated. – TcM May 18 '20 at 14:21

score 0 · Answer 2 · answered May 15 '20 at 16:30

0

Here's an approach with expand.grid.

FindComb <- function(vector,variables){
  grid <- do.call(expand.grid,lapply(variables,function(x) which(vector == x))) 
  sum(Reduce(`&`,lapply(seq(2,ncol(grid)), function(x) grid[,x-1] < grid[,x])))
}
FindComb(data,c(1,2,3))
#[1] 98

I expect it will not scale well with longer vectors or more numbers, but it works OK for smaller scales:

set.seed(3)
data2 <- sample(1:9,1000,replace = TRUE)
FindComb(data2,c(8,2,3))
[1] 220139

answered May 15 '20 at 16:30

Ian Campbell

23,484
14
36
57

Yes, for 5 numbers I got an error. '''> set.seed(3) > data2 <- sample(1:5,1000,replace = TRUE) > FindComb(data2,c(1,2,3,4,5)) Error: vector memory exhausted (limit reached?)''' – TcM May 15 '20 at 21:29
It's possible to improve this approach by filtering the vectors before calling `expand.grid`. But since you can't tell us what your goal is, it's probably not worth the effort. – Ian Campbell May 15 '20 at 21:51
Thank you very much for your answer. My goal is simply to find this number "set.seed(3) data2 <- sample(1:k,1000,replace = TRUE) FindComb(data2,1:k)" but for large k, k=9 or 10 would work. And then repeat this process for many data sets (different vector than above but takes values from 1 to k). So your example above is perfect but instead for c(8,2,3) I would need 1:9. Thanks – TcM May 15 '20 at 22:10

Count the number of occurring a specific ordered sequence in R

2 Answers2