1
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times

find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {        
    for(sublen in len:1) 
    {
        for(inlen in 0:sublen) 
        {
            pat <- paste0("((.{", sublen-inlen, "})(.)(.{", inlen, "}))", reps("(\\2.\\4)", th-1))
            r <- regexpr(pat, string, perl = TRUE)
            if (attr(r, "capture.length")[1] > 0)
            {
                if (r > 0) 
                {
                    substring(string, r, r + attr(r, "capture.length")[1] - 1)
                }  
            }
        }             
    }             
}

Why doesn't this code work? Basically, this code will accept input strings as 110111111 and output all the patterns satisfying only one constraint:

Which appear consecutively for at least 3 times.

However, apart from this, it will also output patterns having a jitter of 1 character, i.e. patterns like 110 since it appears consecutively for three times except at the last position. But, this just outputs NULL. Another example can be of: a0cc0vaaaabaaadbaaabbaa00bvw. Here, one of the output will be aaaab.

Edit: the input can be a string containing characters or numbers. Also, the minimum length of a match should be atleast 2. And yes, the matches overlap. Also, the input will be of the form:

find.string("a0cc0vaaaabaaadbaaabbaa00bvw")` or `find.string("110111111")
halfer
  • 19,824
  • 17
  • 99
  • 186
Qirohchan
  • 1,057
  • 1
  • 9
  • 15
  • 2
    The question could be improved by giving example of usage. For example '`find.strings("a0cc0vaaaabaaadbaaabbaa00bvw")` should return a character vector containing the string `"aaaab"`'. – Richie Cotton Sep 07 '14 at 10:53
  • 1
    It also isn't clear to me what the inputs can be. Is it always a single string? A character vector? Are the characters always lower case letters or number? Or are upper case letters and punctuation allowed? – Richie Cotton Sep 07 '14 at 10:55
  • And I don't know what the shortest match allowed is. For example, `11` appears many times in the first example string, and matches of length 1 appear in many positions. Can matches overlap? – Richie Cotton Sep 07 '14 at 10:57
  • Also, how many characters is your longest input string? For short strings, you could enumerate all possible matches, but this will quickly get very big. – Richie Cotton Sep 07 '14 at 10:59
  • You may find everything you want by using `rle(strsplit(input_string))` and selecting those elements for which `$length >=3` is true. – Carl Witthoft Sep 07 '14 at 12:07
  • @RichieCotton, the input can be a string containing characters or numbers. Also, the minimum length of a match should be atleast 2. And yes, the matches overlap. Also, the input will be of the form, `find.string("a0cc0vaaaabaaadbaaabbaa00bvw") or find.string("110111111"). – Qirohchan Sep 07 '14 at 12:52
  • @RichieCottong, for example, on experimenting in this code, what I did earlier was set a bool variable `flag` which was set to `TRUE` on finding the maximal length match. However, it had the problem that it returned only one rule. For eg: in the input as `101101101110110110`, it returned only `101` like `110` but as you can clearly see, there are other matches as well. – Qirohchan Sep 07 '14 at 12:57
  • 2
    Your question title is literally one of the exact close reasons provided to moderators. Try looking at [how to ask a good question](http://stackoverflow.com/help/how-to-ask) or [how to create a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It's important to edit the question with additional information people ask for; do not simply respond in comments - that doesn't improve the question itself. – MrFlick Sep 07 '14 at 14:18

1 Answers1

1

I haven't looked in depth into the logic of your function, but there's an obvious reason why it sometimes returns NULL. If you don't explicitly use the return function, R functions will return the last expression that they evaluate.

That occurs when sublen equals 1 (outer loop) and inlen equals sublen (inner loop). If attr(r, "capture.length")[1] > 0 and r > 0, the value that is returned will be substring(string, r, r + attr(r, "capture.length")[1] - 1). If one of those conditions isn't satisfied, then the if function returns NULL, and hence find.strings returns NULL.

You can see how this works with a simpler example:

f <- function() if(FALSE) 1
print(f())
## NULL

You need to store the results from each loop in a variable, and return that.


A couple of other obvious code improvements:

  1. You can combine your if statements together using logical and.

    if (attr(r, "capture.length")[1] > 0 && r > 0)

  2. regexpr is vectorised, so you can probably get rid of that inner loop, and speed your code up.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • ,I added the `return` statement in the end and made it as `return(substring(string, r, r + attr(r, "capture.length")[1] - 1))`. I am having a problem. Why does it return only one pattern? It should return all the pattern for which the condition in the if statement is satisfied. For example, after adding the return statement, try executing, `find.string("101101101110110110")`. Here, it should three patterns as far as I can see, i.e. `101` and `011` and `110`. – Qirohchan Sep 08 '14 at 13:32