Count the number of overlapping substrings within a string

Question

example:

s <- "aaabaabaa"
p <- "aa"

I want to return 4, not 3 (i.e. counting the number of "aa" instances in the initial "aaa" as 2, not 1).

Is there any package to solve it? Or is there any way to count in R?

I think the OP wants to count the number of occurrences of the string `"aa"` in `s`, counting the two overlapping occurrences in `"aaa"`. There might be something useful in the genetics/Bioconductor tools. — Ben Bolker, May 24 '14 at 02:11
https://stat.ethz.ch/pipermail/r-help/2009-December/222521.html — Ben Bolker, May 24 '14 at 02:13
`sum(grepl(p, sapply(1:(nchar(s) - 1), function(ii) substr(s, ii, ii + 1)))) ` — rawr, May 24 '14 at 02:14

score 8 · Answer 1 · edited May 23 '17 at 12:11

8

I believe that

find_overlaps <- function(p,s) {
    gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
    if (length(gg)==1 && gg==-1) 0 else length(gg)
}


find_overlaps("aa","aaabaabaa")  ## 4
find_overlaps("not_there","aaabaabaa") ## 0 
find_overlaps("aa","aaaaaaaa")  ## 7

will do what you want, which would be more clearly expressed as "finding the number of overlapping substrings within a string".

This a minor variation on Finding the indexes of multiple/overlapping matching substrings

edited May 23 '17 at 12:11

Community

1
1

answered May 24 '14 at 02:19

Ben Bolker

211,554
25
370
453

I didn't see that you also gave the solution. Mine works too but is a little clunkier (but perhaps more transparent). – Ben Bolker May 24 '14 at 02:28
I needed a more general approach and have posted an answer that uses your solution. My approach is surely not ideal. If you generalize your approach within your own post I will delete my answer. – Mark Miller May 08 '18 at 00:37

Rich Scriven · Answer 2 · 2018-05-08T00:44:02.323

3

substring might be useful here, by taking every successive pair of characters.

( ss <- sapply(2:nchar(s), function(i) substring(s, i-1, i)) )
## [1] "aa" "aa" "ab" "ba" "aa" "ab" "ba" "aa"
sum(ss %in% p)
## [1] 4

edited May 08 '18 at 00:44

answered May 24 '14 at 02:27

Rich Scriven

97,041
11
181
245

that would be fine in any case. @rawr could have posted the comment as an answer if they wanted. – Ben Bolker May 24 '14 at 02:34

score 1 · Answer 3 · answered May 08 '18 at 00:36

I needed the answer to a related more-general question. Here is what I came up with generalizing Ben Bolker's solution:

my.data <- read.table(text = '
  my.string   my.cov
     1.2...        1
     .21111        2
     ..2122        3
     ...211        2
     112111        4
     212222        1
', header = TRUE, stringsAsFactors = FALSE)

desired.result.2ch <- read.table(text = '
  my.string   my.cov   n.11   n.12   n.21   n.22
     1.2...        1      0      0      0      0
     .21111        2      3      0      1      0
     ..2122        3      0      1      1      1
     ...211        2      1      0      1      0
     112111        4      3      1      1      0
     212222        1      0      1      1      3
', header = TRUE, stringsAsFactors = FALSE)

desired.result.3ch <- read.table(text = '
  my.string   my.cov   n.111   n.112   n.121   n.122   n.222   n.221   n.212   n.211
     1.2...        1       0       0       0       0       0       0       0       0
     .21111        2       2       0       0       0       0       0       0       1
     ..2122        3       0       0       0       1       0       0       1       0
     ...211        2       0       0       0       0       0       0       0       1
     112111        4       1       1       1       0       0       0       0       1
     212222        1       0       0       0       1       2       0       1       0
', header = TRUE, stringsAsFactors = FALSE)

find_overlaps <- function(s, my.cov, p) {
    gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
    if (length(gg)==1 && gg==-1) 0 else length(gg)
}

p <- c('11', '12', '21', '22', '111', '112', '121', '122', '222', '221', '212', '211')

my.output <- matrix(0, ncol = (nrow(my.data)+1), nrow = length(p))

for(i in seq(1,length(p))) {
    my.data$p <- p[i]
    my.output[i,1] <- p[i]
    my.output[i,(2:(nrow(my.data)+1))] <-apply(my.data, 1, function(x) find_overlaps(x[1],  x[2],  x[3]))
    apply(my.data, 1, function(x) find_overlaps(x[1],  x[2],  x[3]))
}

my.output
desired.result.2ch
desired.result.3ch

pre.final.output <- matrix(t(my.output[,2:7]), ncol=length(p), nrow=nrow(my.data))

final.output <- data.frame(my.data[,1:2], t(apply(pre.final.output, 1, as.numeric)))
colnames(final.output) <- c(colnames(my.data[,1:2]), paste0('x', p))
final.output

#  my.string my.cov x11 x12 x21 x22 x111 x112 x121 x122 x222 x221 x212 x211
#1    1.2...      1   0   0   0   0    0    0    0    0    0    0    0    0
#2    .21111      2   3   0   1   0    2    0    0    0    0    0    0    1
#3    ..2122      3   0   1   1   1    0    0    0    1    0    0    1    0
#4    ...211      2   1   0   1   0    0    0    0    0    0    0    0    1
#5    112111      4   3   1   1   0    1    1    1    0    0    0    0    1
#6    212222      1   0   1   1   3    0    0    0    1    2    0    1    0

pgcudahy · Answer 4 · 2019-05-30T09:40:15.853

A tidy, and I think more readable solution is

library(tidyverse)
PatternCount <- function(text, pattern) {
    #Generate all sliding substrings
    map(seq_len(nchar(text) - nchar(pattern) + 1), 
        function(x) str_sub(text, x, x + nchar(pattern) - 1)) %>%
    #Test them against the pattern
    map_lgl(function(x) x == pattern) %>%
    #Count the number of matches
    sum
}

PatternCount("aaabaabaa", "aa")
# 4

Count the number of overlapping substrings within a string

4 Answers4

Linked

Related