5

I have run into a problem with the ruby regex. I need to find all (potentially overlapping) matches. This is a simplification of the problem:

#Simple example
"Hey".scan(/../)
=> ["He"] 
#Actual results

#With overlapping matches the result should be
=> ["He"], ["ey"]

The regex I am trying to execute and get all results for looks like this:

"aaaaaa".scan(/^(..+)\1+$/) #This looks for multiples of (here) "a" bigger than one that "fills" the entire string. "aa"*3 => true, "aaa"*2 => true. "aaaa"*1,5 => false.
 => [["aaa"]] 

#With overlapping results this should be
 => [["aa"],["aaa"]]

Is there a library or a way to do regex in ruby to get the results I am after?

I found some clues that this was possible in Perl, but after hours of research I did not find anything about a Ruby way of doing this.

However I was able to find this "Javascript Regex - Find all possible matches, even in already captured matches", but I am not able to find anything similar for Ruby, nor find something similar to the last index property in the Ruby version. To be honest I don't think that it would have worked anyways since the regex I intend to use is recursive and relies on the entire string, while that method chops away at the string.

Community
  • 1
  • 1
Automatico
  • 12,420
  • 9
  • 82
  • 110
  • So basically you want a permutation of a string ? – HamZa May 04 '13 at 23:34
  • No. It is just an example. The actual regex is a bit harder, but that is the issue that arrises. The scan method chops off the parts of the string it finds and continues. I need it to preserve the string so it can be viewed by the next possible match. – Automatico May 04 '13 at 23:36
  • I'm not sure, but the first and second examples are a bit contradicting. Following the first example, I would think that the second one should return `[aa, aa, aa, aa, aa...., aaa, aaa, aaa ..., aaaa, aaaa, aaaa ...., aaaaa, aaaaa, aaaaaa]` – HamZa May 04 '13 at 23:48
  • No. Thats not what that regex looks for. The answer should be `aa, aaa`. Those are the only matches that is possible. – Automatico May 04 '13 at 23:50
  • What makes you think another method will change the behavior or the regex? – squiguy May 04 '13 at 23:51
  • You want it recursive, so it should match /aa/aaaa, a/aa/aaa, aa/aa/aa, aaa/aa/a, aaaa/aa/, /aaa/aaa, a/aaa/aa and so on. – HamZa May 04 '13 at 23:52
  • 1
    There's a Perl thing for this, but it can blow up pretty spectacularly pretty quickly. I needed something similar and gave up. – Dave Newton May 04 '13 at 23:53
  • No. Look up what the \1+ regex does. What you explain is not what that regex looks for. it looks for `aa*aa*aa` and `aaa*aaa`, but catces only the first `aa` or `aaa`. This is by no means an simple question to answer. It is actually a really hard question which I have spent hours researching not finding any evidence of it being possible with ruby. I know there are some perl scripts that are capable of doing this, but I need it to be ruby. – Automatico May 04 '13 at 23:55
  • Well I know it's hard, the only thing I could think of is a (normal) loop with an offset. – HamZa May 05 '13 at 00:02
  • @HamZaDzCyberDeV Apparently you don't see how hard it is. That can't be done. You can't tamper with the string. That is the whole point. The next results are dependent of the same string as the first. – Automatico May 05 '13 at 00:05
  • @DaveNewton Did you ever use the perl thing? Link? Maybe I can run a perl script from ruby and hack it that way. – Automatico May 05 '13 at 00:05
  • @Cort3z Sorry but if you are sure that it can't be done then why are you asking for it ? And the earlier suggestion about loops is just to say to create a custom "mini-parser". Can you edit the "aaaaaa" part and provide the expected output with "abcdef" that will make it more clearer. – HamZa May 05 '13 at 00:08
  • @HamZaDzCyberDeV Then make the loop and answer the question. I can not edit the "aaaaaa" part, because that is one of the few things that triggers that regex. – Automatico May 05 '13 at 00:13
  • @Cort3z Quick demo in [PHP](http://codepad.viper-7.com/BqDuR6), I'm not a ruby coder :) – HamZa May 05 '13 at 00:31
  • @Cort3z IIRC I ended up looking at the solution in HOP; let me search. – Dave Newton May 05 '13 at 00:34
  • @Cort3z Yes; see [HOP 6.5](http://hop.perl.plover.com/book/pdf/06InfiniteStreams.pdf). It didn't work in my particular situation, but may for you. – Dave Newton May 05 '13 at 00:36
  • @Cort3z Although, re-reading your question, I guess that isn't what you want. – Dave Newton May 05 '13 at 00:38
  • @HamZaDzCyberDeV Unfortunately the regex fails if you put `aaaaaaaa`. Then it should result in `aa` and `aaaa`. – Automatico May 05 '13 at 00:39
  • @Cort3z Then that means that I don't understand the "logic" behind this ... – HamZa May 05 '13 at 00:40
  • Updated question to hopefully be less ambiguous. – Automatico May 05 '13 at 09:28
  • Have you tried lookaheads? http://www.regular-expressions.info/lookaround.html – Akash May 05 '13 at 18:20

5 Answers5

6

Kind of old topic... Not sure if I understand, but best I can find is this:

"Hey".scan(/(?=(..))/)
 => [["He"], ["ey"]] 

"aaaaaa".scan(/(?=(..+)\1)/)
 => [["aaa"], ["aa"], ["aa"]] 

scan walks thru every byte and the "positive look-ahead" (?=) tests the regexp (..+)\1 in every step. Look-aheads don't consume bytes, but the capture group inside it returns the match if it exists.

Fernando Fabreti
  • 4,277
  • 3
  • 32
  • 33
3

Are you just missing the second capture group?

"aaaaaa".scan(/(..+?)(\1+)/)
#=> [["aa", "aaaa"]]

It seems like there might be something wrong with your expectation.

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • You know what. I just realised I had a small mistake. – Automatico May 05 '13 at 01:00
  • But the solutions suggested still did not work with the original problem. That regex should trigger on `aaa` as well. – Automatico May 05 '13 at 01:03
  • 5
    I can't tell which problem you mean but the `\1+` will never match `aaa` (if `\1` is `aa`), only multiples of `aa`. – pguardiario May 05 '13 at 01:08
  • @Cort3z the regexp should not capture `aaa` as the non-greedy modifier `?` dictates it will be satisfied as soon as it sees `aa`. – dbenhur May 05 '13 at 01:47
  • @dbenhur If that is the case, then why would it find `aaaa`? I might have misunderstood the regex, that is completely possible. But as I understood it, the regex does this: "find zero or one instance of two or more characters that repeat". As mentioned too, I initially messed messed up the more advance regex is the question. However, the original question `"Hey".scan(/../) => "He", "ey"` is stil sort of valid. – Automatico May 05 '13 at 09:00
  • It feels a bit like you're asking 2 separate questions here. Maybe you should consider splitting them up. – pguardiario May 05 '13 at 10:39
  • 1
    @Cort3z the "aa" is teh capture of `(..+?)`, the "aaaa" is the capture of `(\1+)`. The first is non-greedy, the second greedy. – dbenhur May 05 '13 at 15:59
3

The problem with any solution based on scan is it won't find overlapping matches as scan always makes forward progress. It might be possible to recast the regexp so it's entirely embedded in a zero-width positive lookahead and then use scan, but IIRC there are otherwise valid regexp patterns that do not work in lookahead or lookbehind.

There's some ambiguity in the question as posed. This interprets the question as really asking to find all the unique matching substrings of a target string for which a regexp will match. Though not strictly necessary it uses ruby 2.0 lazy evaluation to avoid excessive intermediate array allocations.

class String
  def each_substring
    Enumerator.new do |y|
      (0...length).each do |b|
        (b...length).each do |e|
          y << self[b..e]
        end
      end
      y << '' 
    end
  end
end

class Regexp
  def all_possible_matches(str)
    str.each_substring.lazy.
    map { |s| match(s) }.
    reject(&:nil?).
    map { |m| m.size > 1 ? m[1..-1] : m[0] }.
    to_a.uniq
  end
end

/.{2,4}/.all_possible_matches('abcde')
=> ["ab", "abc", "abcd", "bc", "bcd", "bcde", "cd", "cde", "de"]

/^(..+?)\1+$/.all_possible_matches('aaaaaa')
=> [["aa"]]
/^(..+)\1+$/.all_possible_matches('aaaaaa')
=> [["aa"], ["aaa"]]
/^(..+?)\1+$/.all_possible_matches('aaaaaaaaa')
=> [["aa"], ["aaa"]]
/^(..+)\1+$/.all_possible_matches('aaaaaaaaa')
=> [["aa"], ["aaa"], ["aaaa"]]

EDIT: made it return capture groups when present. The OP's desired solution to the non-greedy form of /^(..+?)\1+$/ is wrong as the ? means it will be satisfied with the pattern with the fewest chars.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
dbenhur
  • 20,008
  • 4
  • 48
  • 45
  • I get `undefined method 'lazy' for # – Automatico May 05 '13 at 09:36
  • @Cort3z As I stated in the answer, [`lazy`](http://ruby-doc.org/core-2.0/Enumerable.html#method-i-lazy) is a feature of ruby 2.0. In 1.9 you can just omit it and it should work fine, just produce more intermediate results. – dbenhur May 05 '13 at 16:02
1

I don't understand why your expected results should be like that, but for just applying the regex from different starting points, this will do.

class String
  def awesome_regex_scan r
    (0...length).map{|i| match(r, i)}.map(&:to_a).reject(&:empty?).uniq
  end
end

"Hey".awesome_regex_scan(/../) # => [["He"], ["ey"]]

As written above, it does not match your expected result, and I don't understand why you expect what you do:

"aaaaaa".awesome_regex_scan(/^(..+?)\1+$/) # => [["aaaaaa", "aa"]]
"aaaaaa".awesome_regex_scan(/^(..+)\1+$/) # => [["aaaaaa", "aaa"]]
sawa
  • 165,429
  • 45
  • 277
  • 381
0
class String
  def awesome_regex_scan(pattern)
    result = []
    source = self
    while (match = source.match(pattern))
      result << match.to_s
      source = source.slice(match.begin(0)+1..-1)
    end
    result
  end
end

p "Hey".awesome_regex_scan(/../)
Sam Ruby
  • 4,270
  • 22
  • 21