I would recommend using stringi::stri_count_fixed
as follows:
> library(stringi)
> seqs <- data.frame(sequence=c('ACCTACGT', 'CCCCCCCC', 'ACCCGCCT'))
> opts <- stri_opts_fixed(overlap=TRUE)
> seqs$true_count <- stri_count_fixed(str=seqs$sequence, pattern='CC', opts_fixed=opts)
> seqs
sequence true_count
1 ACCTACGT 1
2 CCCCCCCC 7
3 ACCCGCCT 3
With fixed pattern stringi
is an order of magnitude faster than using gregexpr
:
library(microbenchmark)
# Answer provided by @user20650 in the comments
f1 <- function(x) sapply(gregexpr('(?=CC)', x, perl=T) , function(i) sum(i>0))
f2 <- function(x) stri_count_fixed(
str=x, pattern='CC',
opts_fixed=stri_opts_fixed(overlap=TRUE))
# Generate random sequences
sequence <- stri_rand_strings(n=10000, length=1000, pattern='[ATGC]')
Microbenchmark results:
> microbenchmark(f1(sequence), f2(sequence))
Unit: milliseconds
expr min lq mean median uq max neval
f1(sequence) 290.90393 304.87107 329.11392 313.39819 327.9860 437.10229 100
f2(sequence) 20.99733 21.12559 21.39206 21.26017 21.4377 27.68867 100
You may also take a look at the Biostrings library. From my experience it is usually slower than working with stringi
and requires some additional steps but provides many useful functions designed to work with biological sequences, including countPattern
:
library(Biostrings)
bsequence <- DNAStringSet(sequence)
f3 <- function(x) vcountPattern('CC', x)
Microbenchmark results:
> microbenchmark(f2(sequence), f3(bsequence))
Unit: milliseconds
expr min lq mean median uq max neval
f2(sequence) 20.83336 21.11473 21.36759 21.25088 21.45000 23.80708 100
f3(bsequence) 86.95430 89.10023 89.51665 89.37103 89.87699 91.88203 100
And just to be sure:
> identical(f1(seqs$sequence), f2(seqs$sequence), f3(BStringSet(seqs$sequence))
[1] TRUE