Finding repeated substrings with R

Question

I have the following code for finding out a pattern (consecutively repeated substring) in a string, say 0110110110000. The output patterns are 011 and 110, since they are both repeated within the string. What changes can be done to the following code?

I'd like to identify substrings that start from any position in a given string, and which repeat for at least a threshold number of times. In the above mentioned string, the threshold is three (th = 3). The repeated string should be the maximal repeated string. In the above string, 110 and 011 both satisfy these conditions.

Here's my attempt at doing this:

reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times

find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
    pat <- paste0("(.{", k, "})", reps("\\1", th-1))
    r <- regexpr(pat, string, perl = TRUE)
    if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}

to find a pattern from any position in any given string such that the pattern repeats for a threshold number of times at least. In the above mentioned string threshold number(th = 3) also repeated string should be maximal repeated string. In the above string, 110 and 011 both are satisfying above mentioned conditions. — user3742375, Jun 15 '14 at 14:13
You should make all of that clear in the question itself, so that people don't have to read through comments to discover essential details. — jbaums, Jun 15 '14 at 14:15
What about `0000` in your example string? (That's four zeroes in a row.) — jbaums, Jun 15 '14 at 14:45
Similar question [here](http://stackoverflow.com/q/9216783/489704), although that post refers to substrings that repeat but aren't necessarily adjacent. It's also perl (using constructs that I believe aren't supported in R). — jbaums, Jun 15 '14 at 15:57

score 4 · Answer 1 · answered Jun 15 '14 at 15:55

You can do this with regex:

s <- '0110110110000'

thr <- 3

m <- gregexpr(sprintf('(?=(.+)(?:\\1){%s,})', thr-1), s, perl=TRUE)

unique(mapply(function(x, y) substr(s, x, x+y-1), 
              attr(m[[1]], 'capture.start'), 
              attr(m[[1]], 'capture.length')))

# [1] "011" "110" "0"

The pattern in the gregexpr uses a positive lookahead to prevent characters from being consumed by the match (and so allowing overlapping matches, such as with the 011 and 110). We use a repeated (at least thr - 1 times) backreference to the captured group to look for repeated substrings.

Then we can extract the matched substrings by taking start positions and lengths from the attributes of the result of gregexpr, i.e. the object m.

You didn't specify a minimum string length, so this returns 0 as one of the repeated substrings. If you have a minimum and/or maximum substring length in mind, you can modify the first subexpression of the regex. For example, the following would match only substrings with at least 2 characters.

sprintf('(?=(.{2,})(?:\\1){%s,})', thr-1)

Finding repeated substrings with R

1 Answers1

Linked