5

example:

s <- "aaabaabaa"
p <- "aa"

I want to return 4, not 3 (i.e. counting the number of "aa" instances in the initial "aaa" as 2, not 1).

Is there any package to solve it? Or is there any way to count in R?

user2864740
  • 60,010
  • 15
  • 145
  • 220
frashman
  • 61
  • 3
  • I think the OP wants to count the number of occurrences of the string `"aa"` in `s`, counting the two overlapping occurrences in `"aaa"`. There might be something useful in the genetics/Bioconductor tools. – Ben Bolker May 24 '14 at 02:11
  • https://stat.ethz.ch/pipermail/r-help/2009-December/222521.html – Ben Bolker May 24 '14 at 02:13
  • 1
    `sum(grepl(p, sapply(1:(nchar(s) - 1), function(ii) substr(s, ii, ii + 1)))) ` – rawr May 24 '14 at 02:14

4 Answers4

8

I believe that

find_overlaps <- function(p,s) {
    gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
    if (length(gg)==1 && gg==-1) 0 else length(gg)
}


find_overlaps("aa","aaabaabaa")  ## 4
find_overlaps("not_there","aaabaabaa") ## 0 
find_overlaps("aa","aaaaaaaa")  ## 7

will do what you want, which would be more clearly expressed as "finding the number of overlapping substrings within a string".

This a minor variation on Finding the indexes of multiple/overlapping matching substrings

Community
  • 1
  • 1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • I didn't see that you also gave the solution. Mine works too but is a little clunkier (but perhaps more transparent). – Ben Bolker May 24 '14 at 02:28
  • I needed a more general approach and have posted an answer that uses your solution. My approach is surely not ideal. If you generalize your approach within your own post I will delete my answer. – Mark Miller May 08 '18 at 00:37
3

substring might be useful here, by taking every successive pair of characters.

( ss <- sapply(2:nchar(s), function(i) substring(s, i-1, i)) )
## [1] "aa" "aa" "ab" "ba" "aa" "ab" "ba" "aa"
sum(ss %in% p)
## [1] 4
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
1

I needed the answer to a related more-general question. Here is what I came up with generalizing Ben Bolker's solution:

my.data <- read.table(text = '
  my.string   my.cov
     1.2...        1
     .21111        2
     ..2122        3
     ...211        2
     112111        4
     212222        1
', header = TRUE, stringsAsFactors = FALSE)

desired.result.2ch <- read.table(text = '
  my.string   my.cov   n.11   n.12   n.21   n.22
     1.2...        1      0      0      0      0
     .21111        2      3      0      1      0
     ..2122        3      0      1      1      1
     ...211        2      1      0      1      0
     112111        4      3      1      1      0
     212222        1      0      1      1      3
', header = TRUE, stringsAsFactors = FALSE)

desired.result.3ch <- read.table(text = '
  my.string   my.cov   n.111   n.112   n.121   n.122   n.222   n.221   n.212   n.211
     1.2...        1       0       0       0       0       0       0       0       0
     .21111        2       2       0       0       0       0       0       0       1
     ..2122        3       0       0       0       1       0       0       1       0
     ...211        2       0       0       0       0       0       0       0       1
     112111        4       1       1       1       0       0       0       0       1
     212222        1       0       0       0       1       2       0       1       0
', header = TRUE, stringsAsFactors = FALSE)

find_overlaps <- function(s, my.cov, p) {
    gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
    if (length(gg)==1 && gg==-1) 0 else length(gg)
}

p <- c('11', '12', '21', '22', '111', '112', '121', '122', '222', '221', '212', '211')

my.output <- matrix(0, ncol = (nrow(my.data)+1), nrow = length(p))

for(i in seq(1,length(p))) {
    my.data$p <- p[i]
    my.output[i,1] <- p[i]
    my.output[i,(2:(nrow(my.data)+1))] <-apply(my.data, 1, function(x) find_overlaps(x[1],  x[2],  x[3]))
    apply(my.data, 1, function(x) find_overlaps(x[1],  x[2],  x[3]))
}

my.output
desired.result.2ch
desired.result.3ch

pre.final.output <- matrix(t(my.output[,2:7]), ncol=length(p), nrow=nrow(my.data))

final.output <- data.frame(my.data[,1:2], t(apply(pre.final.output, 1, as.numeric)))
colnames(final.output) <- c(colnames(my.data[,1:2]), paste0('x', p))
final.output

#  my.string my.cov x11 x12 x21 x22 x111 x112 x121 x122 x222 x221 x212 x211
#1    1.2...      1   0   0   0   0    0    0    0    0    0    0    0    0
#2    .21111      2   3   0   1   0    2    0    0    0    0    0    0    1
#3    ..2122      3   0   1   1   1    0    0    0    1    0    0    1    0
#4    ...211      2   1   0   1   0    0    0    0    0    0    0    0    1
#5    112111      4   3   1   1   0    1    1    1    0    0    0    0    1
#6    212222      1   0   1   1   3    0    0    0    1    2    0    1    0
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
1

A tidy, and I think more readable solution is

library(tidyverse)
PatternCount <- function(text, pattern) {
    #Generate all sliding substrings
    map(seq_len(nchar(text) - nchar(pattern) + 1), 
        function(x) str_sub(text, x, x + nchar(pattern) - 1)) %>%
    #Test them against the pattern
    map_lgl(function(x) x == pattern) %>%
    #Count the number of matches
    sum
}

PatternCount("aaabaabaa", "aa")
# 4
pgcudahy
  • 1,542
  • 13
  • 36