10

I want to find the pattern from any position in any given string such that the pattern repeats for a threshold number of times at least. For example for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab". Another example: for the string "ff00f0f0f0f0f0f0f0f0000" the pattern should be "0f". In both cases threshold has been taken as 3 i.e. the pattern should be repeated for at least 3 times.

If someone can suggest an optimized method in R for finding a solution to this problem, please do share with me. Currently I am achieving this by using 3 nested loops, and it's taking a lot of time.

Thanks!

phoenix
  • 335
  • 1
  • 4
  • 19
  • In your first example, the pattern `ab` also qualifies. Are you looking for the *maximal* repeated string? – Carl Witthoft Jan 09 '14 at 12:26
  • @carl: Yeah! I am looking for the maximal repeated string. – phoenix Jan 09 '14 at 12:28
  • So in the first example `"a"` would be the answer? I count 14 instances of `"a"`. – Simon O'Hanlon Jan 09 '14 at 12:30
  • No "a" would not be an answer for the 1st example. Also the repetitions should be consecutive – phoenix Jan 09 '14 at 12:31
  • @phoenix ok, why? `"a"` repeats 4 times. And then why also `"aaab"` and not `"aaaab"`? – Simon O'Hanlon Jan 09 '14 at 12:31
  • My mistake! It's "aaaab". And I have added my attempted code. – phoenix Jan 09 '14 at 12:39
  • Related: http://stackoverflow.com/questions/10355103/finding-the-longest-repeated-substring and http://stackoverflow.com/questions/9614326/longest-repeated-substring-better-complexity – Andrie Jan 09 '14 at 12:43
  • 2
    This is very confusing. As @SimonO'Hanlon says, the pattern `"a"` is repeated **four times**, and occurs **before** the `"aaaab"` is thrice repeated. It reaches your threshold of being repeated (more than) three times consecutively, so why doesn't `"aaaa"` qualify. Apparently you also have an unstated requirement that the repeated element must be longer than one character... – Josh O'Brien Jan 09 '14 at 13:13
  • @Josh: I am looking for the MAXIMAL length repeated sub-string that repeats itself consecutively at least threshold number of times. – phoenix Jan 09 '14 at 13:17
  • @phoenix ok, that makes sense. You want the longest string that repeats, in this case 3 times. You don't actually care about the number of repeats, just the length of the string and whether it repeats at least 3 times? – Simon O'Hanlon Jan 09 '14 at 13:28
  • It occurs to me that a minor modification of either Lev-Zimpel (zip) or Huffman compression algorithms should provide you with the answer you want. Since there are some whizbang fast :-) implementations of those, it might be worth investigating that approach. – Carl Witthoft Jan 09 '14 at 13:29
  • @phoenix can matches be overlapping? e.g. in `"aaaaaa"` is the maximal length string that repeats 3 times `"aa"` or `"aaaa"` (start at position one, match the first four characters, then shift a long the vector one character and the next four characters also match etc)? – Simon O'Hanlon Jan 09 '14 at 13:34

6 Answers6

11

Use regular expressions, which are made for this type of stuff. There may be more optimized ways of doing it, but in terms of easy to write code, it's hard to beat. The data:

vec <- c("a0cc0vaaaabaaaabaaaabaa00bvw","ff00f0f0f0f0f0f0f0f0000")    

The function that does the matching:

find_rep_path <- function(vec, reps) {
  regexp <- paste0(c("(.+)", rep("\\1", reps - 1L)), collapse="")
  match <- regmatches(vec, regexpr(regexp, vec, perl=T))
  substr(match, 1, nchar(match) / reps)  
}

And some tests:

sapply(vec, find_rep_path, reps=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw      ff00f0f0f0f0f0f0f0f0000 
#                      "aaaab"                       "0f0f" 
sapply(vec, find_rep_path, reps=5L)
# $a0cc0vaaaabaaaabaaaabaa00bvw
# character(0)
# 
# $ff00f0f0f0f0f0f0f0f0000
# [1] "0f"

Note that with threshold as 3, the actual longest pattern for the second string is 0f0f, not 0f (reverts to 0f at threshold 5). In order to do this, I use back references (\\1), and repeat these as many time as necessary to reach threshold. I need to then substr the result because annoyingly base R doesn't have an easy way to get just the captured sub expressions when using perl compatible regular expressions. There is probably a not too hard way to do this, but the substr approach works well in this example.


Also, as per the discussion in @G. Grothendieck's answer, here is the version with the cap on length of pattern, which is just adding the limit argument and the slight modification of the regexp.

find_rep_path <- function(vec, reps, limit) {
  regexp <- paste0(c("(.{1,", limit,"})", rep("\\1", reps - 1L)), collapse="")
  match <- regmatches(vec, regexpr(regexp, vec, perl=T))
  substr(match, 1, nchar(match) / reps)  
}
sapply(vec, find_rep_path, reps=3L, limit=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw      ff00f0f0f0f0f0f0f0f0000 
#                          "a"                         "0f" 
BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • +1 really cool solution. Clever use of regexing too. This seems highly efficient (to my mind at least). Nice one. – Simon O'Hanlon Jan 09 '14 at 14:28
  • Side comment, `find_rep_path` code is actually vectorized already, so you don't need to use sapply, but if you do that your result vector may be a different length than your input when there aren't matches. With sapply you get a list with all the names and matches as in the second example. – BrodieG Jan 09 '14 at 14:48
  • @BrodieG I am trying to get patterns allowing jitter of 1 character. Example: for the string "a0cc0vaaaabaaadbaaabbaa00bvw" the pattern should come out to be "aaajb" where "j" can be anything. Can u suggest a modification of the above mentioned code for pattern finding, that could allow such jitters? Expecting a reply Thanks :) – phoenix Mar 18 '14 at 10:44
10

find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.

reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times

find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
    for(k in len:1) {
        pat <- paste0("(.{", k, "})", reps("\\1", th-1))
        r <- regexpr(pat, string, perl = TRUE)
        if (attr(r, "capture.length") > 0) break
    }
    if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}

and here are some tests. The last test processes the entire text of James Joyce's Ulysses in 1.4 seconds on my laptop:

> find.string("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
> 
> joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt") 
> joycec <- paste(joyce, collapse = " ") 
> system.time(result <- find.string2(joycec, len = 25))

   user  system elapsed 
   1.36    0.00    1.39 
> result
[1] " Hoopsa boyaboy hoopsa!"

ADDED

Although I developed my answer before having seen BrodieG's, as he points out they are very similar to each other. I have added some features of his to the above to get the solution below and tried the tests again. Unfortunately when I added the variation of his code the James Joyce example no longer works although it does work on the other two examples shown. The problem seems to be in adding the len constraint to the code and may represent a fundamental advantage of the code above (i.e. it can handle such a constraint and such constraints may be essential for very long strings).

find.string2 <- function(string, th = 3, len = floor(nchar(string)/th)) {
    pat <- paste0(c("(.", "{1,", len, "})", rep("\\1", th-1)), collapse = "")
    r <- regexpr(pat, string, perl = TRUE)
    ifelse(r > 0, substring(string, r, r + attr(r, "capture.length")-1), "")
}

> find.string2("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string2("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"

> system.time(result <- find.string2(joycec, len = 25))
   user  system elapsed 
      0       0       0 
> result
[1] "w"

REVISED The James Joyce test that was supposed to be testing find.string2 was actually using find.string. This is now fixed.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • You are adding an important constraint in your Joyce test (`len`=25) that isn't specified by the OP. Otherwise the heart of your code (`regexpr` with a repeated `\\1` backreference) is basically exactly the same as mine. – BrodieG Jan 09 '14 at 15:05
  • The constraint seems important to prevent potentially quadratic behavior. By default the constraint is set to be large enough that it is not active but in tests like the James Joyce one a reasonable estimate of an upper bound could be made and seems an advantage of this approach. – G. Grothendieck Jan 09 '14 at 15:38
  • 1
    +1 I think this is a good answer too and it is different enough to stand in its own right (especially if developed without knowledge of prior answers, which is something I frequently encounter when working on a longer problem such as this one - you finish, post your answer only to then find someone else already posted something similar). – Simon O'Hanlon Jan 09 '14 at 15:53
  • @Grothendieck: Thanks for the wonderful solution. I did not completely understand it though. Can you please explain this line of code: "pat <- paste0("(.{", k, "})", reps("\\1", th-1))". – phoenix Jan 13 '14 at 16:50
  • 1
    Suppose `k <- 5; th <- 3`. Then that line generates the regular expression: `"(.{5})\\1\\1"` which matches a 5 character string three times. – G. Grothendieck Jan 13 '14 at 16:56
  • Great! And what is the time complexity of regexpr? Is it polynomial? – phoenix Jan 13 '14 at 17:22
  • For the regular expression shown each `regexpr` call should be linear in the input. – G. Grothendieck Jan 13 '14 at 17:44
  • @G.Grothendieck : I am trying to get patterns allowing jitter of 1 character. Example: for the string "a0cc0vaaaabaaadbaaabbaa00bvw" the pattern should come out to be "aaajb" where "j" can be anything. Can u suggest a modification of the above mentioned code for pattern finding, that could allow such jitters? Expecting a reply Thanks :) – phoenix Mar 18 '14 at 10:29
2

Not optimized (even it is fast) function , but I think it is more R way to do this.

  1. Get all patterns of certains length > threshold : vectorized using mapply and substr
  2. Get the occurrence of these patterns and extract the one with maximum occurrence : vectorized using str_locate_all.
  3. Repeat 1-2 this for all lengths and tkae the one with maximum occurrence.

Here my code. I am creating 2 functions ( steps 1-2) and step 3:

library(stringr)
ss = "ff00f0f0f0f0f0f0f0f0000" 
ss <- "a0cc0vaaaabaaaabaaaabaa00bvw"
find_pattern_length <- 
function(length=1,ss){
  patt = mapply(function(x,y) substr(ss,x,y),
                1:(nchar(ss)-length),
                (length+1):nchar(ss))
  res = str_locate_all(ss,unique(patt))
  ll = unlist(lapply(res,length))
  list(patt = patt[which.max(ll)],
       rep = max(ll))
}

get_pattern_threshold <- 
function(ss,threshold =3 ){
  res <- 
  sapply(seq(threshold,nchar(ss)),find_pattern_length,ss=ss)
  res[,which.max(res['rep',])]
}

some tests:

get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',5)
$patt
[1] "0f0f0"

$rep
[1] 6

> get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',2)
$patt
[1] "f0"

$rep
[1] 18
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • can you please explain your code? And according to question posted the answer should be just "aaaab" not this whole list of answers. – phoenix Jan 09 '14 at 13:04
  • I think you want to get the maximum length string that repeats at least the threshold times. The important bit is the length of the string, not the number of repeats. As long as the string repeats 3 or more times it is included. The winner is then the string with the longest length (if I understand correctly, which is no guarantee!) – Simon O'Hanlon Jan 09 '14 at 13:49
1

Since you want at least three repetitions, there is a nice O(n^2) approach.

For each possible pattern length d cut string into parts of length d. In case of d=5 it would be:

a0cc0
vaaaa
baaaa
baaaa
baa00
bvw

Now look at each pairs of subsequent strings A[k] and A[k+1]. If they are equal then there is a pattern of at least two repetitions. Then go further (k+2, k+3) and so on. Finally you also check if suffix of A[k-1] and prefix of A[k+n] fit (where k+n is the first string that doesn't match).

Repeat it for each d starting from some upper bound (at most n/3).

You have n/3 possible lengths, then n/d strings of length d to check for each d. It should give complexity O(n (n/d) d)= O(n^2).

Maybe not optimal but I found this cutting idea quite neat ;)

Łukasz Kidziński
  • 1,613
  • 11
  • 20
1

For a bounded pattern (i.e not huge) it's best I think to just create all possible substrings first and then count them. This is if the sub-patterns can overlap. If not change the step fun in the loop.

pat="a0cc0vaaaabaaaabaaaabaa00bvw"
len=nchar(pat)
thr=3
reps=floor(len/2)

# all poss strings up to half length of pattern
library(stringr)
pat=str_split(pat, "")[[1]][-1]
str.vec=vector()
for(win in 2:reps)
 {
     str.vec= c(str.vec, rollapply(data=pat,width=win,FUN=paste0, collapse=""))
 }

# the max length string repeated more than 3 times
tbl=table(str.vec)
tbl=tbl[tbl>=3]
tbl[which.max(nchar(names(tbl)))]

aaaabaa 
      3 

NB Whilst I'm lazy and append/grow the str.vec here in a loop, for a larger problem I'm pretty sure the actual length of str.vec is predetermined by the length of the pattern if you care to work it out.

Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33
0

Here is my solution, it's not optimized (build vector with patterns <- c() ; pattern <- c(patterns, x) for example) and can be improve but simpler than yours, I think.

I can't understand which pattern exactly should (I just return the max) be returned but you can adjust the code to what you want exactly.

str <- "a0cc0vaaaabaaaabaaaabaa00bvw"

findPatternMax <- function(str){

  nb <- nchar(str):1
  length.patt <- rev(nb)
  patterns <- c()

  for (i in 1:length(nb)){
    for (j in 1:nb[i]){
      patterns <- c(patterns, substr(str, j, j+(length.patt[i]-1)))
    }
  }
  patt.max <- names(which(table(patterns) == max(table(patterns))))
  return(patt.max)
}


  findPatternMax(str)

  > findPatternMax(str)
  [1] "a"

EDIT : Maybe you want the returned pattern have a min length ?

then you can add a nchar.patt parameter for example :

nchar.patt <- 2 #For a pattern of 2 char min

nb <- nb[length.patt >= nchar.patt]

length.patt <- length.patt[length.patt >= nchar.patt]
Julien Navarre
  • 7,653
  • 3
  • 42
  • 69