22

I'm looking for a way, either in Ruby or Javascript, that will give me all matches, possibly overlapping, within a string against a regexp.


Let's say I have str = "abcadc", and I want to find occurrences of a followed by any number of characters, followed by c. The result I'm looking for is ["abc", "adc", "abcadc"]. Any ideas on how I can accomplish this?

str.scan(/a.*c/) will give me ["abcadc"], str.scan(/(?=(a.*c))/).flatten will give me ["abcadc", "adc"].

ndnenkov
  • 35,425
  • 9
  • 72
  • 104
transient_panda
  • 235
  • 2
  • 5
  • 3
    Readers: the first sentence makes it clear that what is wanted is a method invoked `all_matches(string, regex)`, which works with arbitrary strings and regexes. – Cary Swoveland Jan 23 '16 at 20:32
  • 1
    Generally with regex engines, the matching is "leftmost longest" unless you specify a non-greedy quantifier, where you get "leftmost shortest" match. You can't really expect to get both the shortest and the longest in a single expression. Getting all the shortest and then finding all the permutations of concatenations will be the best strategy. – glenn jackman Jan 23 '16 at 20:48
  • @theTinMan, the original question stated that the problem is about matches given a regex and that this was only an example with the words "lets say". After your edit, this specific regex match looks like the main point of the question. I disagree with your edit. – ndnenkov Jan 23 '16 at 22:32
  • Feel free to edit it. That's how SO works. "Let's say" is imaginary though. We're dealing with programming and with creating a reference book on SO, so conjecture is rarely appropriate. – the Tin Man Jan 23 '16 at 22:36

8 Answers8

11
def matching_substrings(string, regex)
  string.size.times.each_with_object([]) do |start_index, maching_substrings|
    start_index.upto(string.size.pred) do |end_index|
      substring = string[start_index..end_index]
      maching_substrings.push(substring) if substring =~ /^#{regex}$/
    end
  end
end

matching_substrings('abcadc', /a.*c/) # => ["abc", "abcadc", "adc"]
matching_substrings('foobarfoo', /(\w+).*\1/) 
  # => ["foobarf",
  #     "foobarfo",
  #     "foobarfoo",
  #     "oo",
  #     "oobarfo",
  #     "oobarfoo",
  #     "obarfo",
  #     "obarfoo",
  #     "oo"]
matching_substrings('why is this downvoted?', /why.*/)
  # => ["why",
  #     "why ",
  #     "why i",
  #     "why is",
  #     "why is ",
  #     "why is t",
  #     "why is th",
  #     "why is thi",
  #     "why is this",
  #     "why is this ",
  #     "why is this d",
  #     "why is this do",
  #     "why is this dow",
  #     "why is this down",
  #     "why is this downv",
  #     "why is this downvo",
  #     "why is this downvot",
  #     "why is this downvote",
  #     "why is this downvoted",
  #     "why is this downvoted?"]
ndnenkov
  • 35,425
  • 9
  • 72
  • 104
  • @mudasobwa, yours doesn't answer the question (aka given regex, get the substrings that match it). I had the same problem with my initial solution. – ndnenkov Jan 23 '16 at 14:36
  • 1
    I obviously did not downote, but your objection is silly: you provided the codepiece, as I did, not just a magic regexp. The magic regexp for that case is not existing for obvious reason: the problem can’t be solved with a simple state machine. – Aleksei Matiushkin Jan 23 '16 at 14:47
  • 1
    @mudasobwa, what do you mean? Assuming there are no lookarounds in the regex, my solution works for random regexes. Even with lookarounds, the problem is solvable. My objection was that your solution can't parameterise the regex. – ndnenkov Jan 23 '16 at 14:49
  • 1
    I suggest that you remove/replace the *"Why is this downvoted?"* example as there's a strong possibility that it will only attract more downvotes. While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – vaultah Jan 24 '16 at 07:04
11

In Ruby you could achieve the expected result using:

str = "abcadc"
[/(a[^c]*c)/, /(a.*c)/].flat_map{ |pattern| str.scan(pattern) }.reduce(:+)
# => ["abc", "adc", "abcadc"]

Whether this way works for you is highly dependent on what you really want to achieve.

I tried to put this into a single expression but I couldn't make it work. I would really like to know if there is some scientific reason this cannot be parsed by regular expressions or if I just don't know enough about Ruby's parser Oniguruma to do it.

aef
  • 4,498
  • 7
  • 26
  • 44
  • 4
    Assuming that OP's string and regex were just an example, this doesn't give a generic answer to the question. – ndnenkov Jan 23 '16 at 14:42
  • 1
    If that is so, please give an example that doesn't work. – aef Jan 23 '16 at 14:43
  • 1
    What if the question was about matching `/b.*d/` instead? Or about `/x.*y.*z.*[^m]*foo/`? – ndnenkov Jan 23 '16 at 14:45
  • 2
    The solution could easily be adapted to your first example. For the second one, you might be correct. I wouldn't exactly know how to adapt to it. That's kind of why I wrote the sentence saying it depends on what OP was exacty trying to achieve. – aef Jan 23 '16 at 14:52
  • btw, this is not expandable to the case when “to” marker is not 1 symbol, while mine is. – Aleksei Matiushkin Jan 23 '16 at 15:02
  • You are correct. To fix this, I would need a replacement for `[^c]` which works on strings and not only characters. Sadly, the negative look-ahead `(?!cx)` doesn't seem to do the trick. Any ideas? – aef Jan 23 '16 at 15:13
  • 2
    @WilliamFeng What's [expected result](http://ideone.com/52UNhp) in `abcadcdc` should'nt it include `abcadc`, `adcdc`? – bobble bubble Jan 23 '16 at 16:21
8

In JS:

function extract_ov_matches(r, s) {
  var res = [], cur;
  r = RegExp('^(?:' + r.source + ')$', r.toString().replace(/^[\s\S]*\/(\w*)$/, '$1').replace('g',''));
  for (var q = 0; q < s.length; ++q)
    for (var w = q; w <= s.length; ++w)
      if (r.test(cur = s.substring(q, w)))
        res.push(cur);
  return res;
}
document.body.innerHTML += "<pre>" + JSON.stringify(extract_ov_matches( /a.*c/g, 'abcadc' ), 0, 4) + "</pre>";

The point here is that you need to create all possible permutations of the input string and then return those that FULLY match the supplied pattern.

Overview of the extract_ov_matches function

  • r is the supplied regex (a compiled regex object, with flags)
  • s is the input string
  • RegExp('^(?:' + r.source + ')$', r.toString().replace(/^[\s\S]*\/(\w*)$/, '$1').replace('g','')); recreates the regex supplied with anchors (^ for start of string and $ for end of string) to match the whole string and the g flag is removed (because the regex will be used with RegExp#test)
  • for (var q = 0; q < s.length; ++q) for (var w = q; w <= s.length; ++w) are used to create all input string permutations
  • if (r.test(cur = s.substring(q, w))) res.push(cur);: if the current permutation fully matches the pattern, it is added to res, that will get returned eventually.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
8

You want all possible matches, including overlapping ones. As you've noted, the lookahead trick from "How to find overlapping matches with a regexp?" doesn't work for your case.

The only thing I can think of that will work in the general case is to generate all of the possible substrings of the string and check each one against an anchored version of the regex. This is brute-force, but it works.

Ruby:

def all_matches(str, regex)
  (n = str.length).times.reduce([]) do |subs, i|
     subs += [*i..n].map { |j| str[i,j-i] }
  end.uniq.grep /^#{regex}$/
end

all_matches("abcadc", /a.*c/) 
#=> ["abc", "abcadc", "adc"]

Javascript:

function allMatches(str, regex) {
  var i, j, len = str.length, subs={};
  var anchored = new RegExp('^' + regex.source + '$');
  for (i=0; i<len; ++i) {
    for (j=i; j<=len; ++j) {
       subs[str.slice(i,j)] = true;
    }
  }
  return Object.keys(subs).filter(function(s) { return s.match(anchored); });
}
Community
  • 1
  • 1
Mark Reed
  • 91,912
  • 16
  • 138
  • 175
5
▶ str = "abcadc"
▶ from = str.split(/(?=\p{L})/).map.with_index { |c, i| i if c == 'a' }.compact
▶ to   = str.split(/(?=\p{L})/).map.with_index { |c, i| i if c == 'c' }.compact
▶ from.product(to).select { |f,t| f < t }.map { |f,t| str[f..t] }
#⇒ [
#  [0] "abc",
#  [1] "abcadc",
#  [2] "adc"
# ]

I believe, that there is a fancy way to find all indices of a character in a string, but I was unable to find it :( Any ideas?

Splitting on “unicode char boundary” makes it to work with strings like 'ábĉ' or 'Üve Østergaard'.

For more generic solution, that accepts any “from” and “to” sequences, one should introduce just a little modification: find all indices of “from” and “to” in the string.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Aleksei Matiushkin
  • 119,336
  • 10
  • 100
  • 160
  • @ndn I can’t, and thanks for pointing me out that I can’t `split(//)` as well. Try it on `'ábĉ'`. – Aleksei Matiushkin Jan 23 '16 at 14:28
  • 1
    Assuming that OP's string and regex were just an example, this doesn't give a generic answer to the question. – ndnenkov Jan 23 '16 at 14:43
  • In ruby 2, instead of the split method you can use: `from = str.chars.to_a.map.with_index { |c, i| i if c == 'a' }.compact` – Casimir et Hippolyte Jan 23 '16 at 14:47
  • 1
    @CasimiretHippolyte I can not use `chars` because of combined diacritics. – Aleksei Matiushkin Jan 23 '16 at 14:49
  • @ndn it does for 1-symbol delimiters. – Aleksei Matiushkin Jan 23 '16 at 14:50
  • It isn't necessary, or particularly desirable, to use "Edit:" or "Update" markers in questions or answers. Remember, SO is a reference book, not a discussion forum. We're writing long-term solutions to programming problems, like an encyclopedia or cookbook. When was the last time you saw "Update:" or "Edit" in those resources? – the Tin Man Jan 23 '16 at 22:32
  • @theTinMan, I agree with you regarding answers, for the reason you gave. I also agree that a silent edit is appropriate for questions, provided that no answers or comments have been posted or if the edit is consistent with all answers and comments. However, we've all seen edits to questions that would have rendered comments puzzling or answers incorrect if an "Edit:" qualifier had not been present. (Your analogy does not apply here.) In the same vein, btw, I'd prefer that attributions for suggested improvements to answers be confined to comments. – Cary Swoveland Jan 24 '16 at 04:21
5

Here's an approach that is similar to @ndn's and @Mark's that works with any string and regex. I've implemented this as a method of String because that's where I'd like to see it. Wouldn't it be a great compliment to String#[] and String#scan?

class String
  def all_matches(regex)
    return [] if empty?
    r = /^#{regex}$/
    1.upto(size).with_object([]) { |i,a|
      a.concat(each_char.each_cons(i).map(&:join).select { |s| s =~ r }) }
  end
end

'abcadc'.all_matches /a.*c/
  # => ["abc", "abcadc", "adc"]
'aaabaaa'.all_matches(/a.*a/)
  #=> ["aa", "aa", "aa", "aa", "aaa", "aba", "aaa", "aaba", "abaa", "aaaba",
  #    "aabaa", "abaaa", "aaabaa", "aabaaa", "aaabaaa"] 
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
4

Approach of RegExp /(a.c)|(a.*c)/g is to match "a" character followed by any character followed by "c" ; "a.*c" is to match "a" followed by any character followed by preceding character followed by "c" character ; note RegExp at (a.*c) could probably be improved. Condition at if checks if last character in input string is "c" , if true , push full input string to res results array

var str = "abcadc"
, res = str.match(/(a.c)|(a.*c)/g); 
if (str[str.length - 1] === "c") res.push(str);

document.body.textContent = res.join(" ")
guest271314
  • 1
  • 15
  • 104
  • 177
1

This JavaScript approach offers an advantage over Wiktor's answer by lazily iterating the substrings of a given string using a generator function, which allows you to consume a single match at a time for very large input strings using a for...of loop, rather than generating a whole array of matches at once, which could lead to out-of-memory exceptions since the amount of substrings for a string grows quadratically with length:

function * substrings (str) {
  for (let length = 1; length <= str.length; length++) {
    for (let i = 0; i <= str.length - length; i++) {
      yield str.slice(i, i + length);
    }
  }
}

function * matchSubstrings (str, re) {
  const subre = new RegExp(`^${re.source}$`, re.flags);
  
  for (const substr of substrings(str)) {
    if (subre.test(substr)) yield substr;
  }
}

for (const match of matchSubstrings('abcabc', /a.*c/)) {
  console.log(match);
}
Patrick Roberts
  • 49,224
  • 10
  • 102
  • 153