2

I am using R and I have a dataframe containing strings of 4 unique letters (DNA). I am interested in counting the times certain unique combinations of letters occur in these strings. One of the possible scenarios is to detect how many times I see the same letter back to back.

I have come across several possible ways to achieve this using regex and packages like stringr but still have one problem.

These methods do not seem to iterate through the substring (letter by letter) and consider the next letter in line to count as an observance. This is a problem where the same letter is repeated more than 2x.

Example (where I want to count the times "CC" occurs and true_count column is my desired output):

sequence  stringr_count  true_count
ACCTACGT      1             1
CCCCCCCC      4             7
ACCCGCCT      2             3
  • this should help http://stackoverflow.com/questions/7878992/finding-the-indexes-of-multiple-overlapping-matching-substrings?answertab=votes#tab-top `lengths(gregexpr("(?=CC)", x, perl=T) )` – user20650 Jun 19 '15 at 19:10
  • use of lengths(gregexpr("(?=CC)", x, perl=T) in strings where there are no occurrences of a particular combination results in 1 being returned. Why would this be the case and what would a good workaround be? e.g. lookin for the substring "AA" in my example above. – user1607359 Jun 19 '15 at 19:35
  • Yup, sorry, the previous comment is incorrect: try `sapply(gregexpr("(?=CC)", x, perl=T) , function(i) length(i[i>0]))` – user20650 Jun 19 '15 at 19:37
  • 1
    That did the trick. Thanks for the help! – user1607359 Jun 19 '15 at 19:42
  • youre welcome ... note zeros answer seems to pull out the right counts also – user20650 Jun 19 '15 at 19:43
  • perhaps this is better `sapply(gregexpr("(?=AA)", x, perl=T) , function(i) sum(i>0))` – user20650 Jun 19 '15 at 19:54
  • While using `gregexpr` works it is just embarrassingly slow. Please check updated version of my answer. – zero323 Jun 20 '15 at 12:17

1 Answers1

4

I would recommend using stringi::stri_count_fixed as follows:

> library(stringi)
> seqs <- data.frame(sequence=c('ACCTACGT', 'CCCCCCCC', 'ACCCGCCT'))
> opts <- stri_opts_fixed(overlap=TRUE)
> seqs$true_count <- stri_count_fixed(str=seqs$sequence, pattern='CC', opts_fixed=opts)
> seqs
  sequence true_count
1 ACCTACGT          1
2 CCCCCCCC          7
3 ACCCGCCT          3

With fixed pattern stringi is an order of magnitude faster than using gregexpr:

library(microbenchmark)

# Answer provided by @user20650 in the comments
f1 <- function(x) sapply(gregexpr('(?=CC)', x, perl=T) , function(i) sum(i>0))

f2 <- function(x) stri_count_fixed(
    str=x, pattern='CC',
    opts_fixed=stri_opts_fixed(overlap=TRUE))

# Generate random sequences
sequence <- stri_rand_strings(n=10000, length=1000, pattern='[ATGC]')

Microbenchmark results:

> microbenchmark(f1(sequence), f2(sequence))
Unit: milliseconds
         expr       min        lq      mean    median       uq       max neval
 f1(sequence) 290.90393 304.87107 329.11392 313.39819 327.9860 437.10229   100
 f2(sequence)  20.99733  21.12559  21.39206  21.26017  21.4377  27.68867   100

You may also take a look at the Biostrings library. From my experience it is usually slower than working with stringi and requires some additional steps but provides many useful functions designed to work with biological sequences, including countPattern:

library(Biostrings)

bsequence <- DNAStringSet(sequence)
f3 <- function(x) vcountPattern('CC', x)

Microbenchmark results:

> microbenchmark(f2(sequence), f3(bsequence))
Unit: milliseconds
          expr      min       lq     mean   median       uq      max neval
  f2(sequence) 20.83336 21.11473 21.36759 21.25088 21.45000 23.80708   100
 f3(bsequence) 86.95430 89.10023 89.51665 89.37103 89.87699 91.88203   100

And just to be sure:

> identical(f1(seqs$sequence), f2(seqs$sequence),  f3(BStringSet(seqs$sequence))
[1] TRUE
zero323
  • 322,348
  • 103
  • 959
  • 935