1

In following code:

"a sasas b".match(/sas/g) //returns ["sas"]

The string actually include two sas strings, a [sas]as b and a sa[sas] b.

How can I modify RegEx to match both?

Another example:

"aaaa".match(/aa/g); //actually include [aa]aa,a[aa]a,aa[aa]

Please consider the issue in general not just above instances.

A pure RexEx solution is preferred.

Handsome Nerd
  • 17,114
  • 22
  • 95
  • 173
  • 2
    I believe you will find your answer in an older question: [Overlapping matches in Regex](http://stackoverflow.com/questions/320448/overlapping-matches-in-regex). – quentinxs Nov 09 '12 at 03:59

4 Answers4

2

If you want to match at least one such "merged" occurrence, then you could do something like:

"a sasas b".match(/s(as)+/g)

If you want to retrieve the matches as separate results, then you have a bit more work to do; this is not a case that regular expressions are designed to handle. The basic algorithm would be:

  • Attempt a match. If it was unsuccessful, stop.
  • Extract the match you are interested in and do whatever you want with it.
  • Take the substring of the original target string, starting from one character following the first character in your match.
  • Start over, using this substring as the new input.

(To be more efficient, you could match with an offset instead of using substrings; that technique is discussed in this question.)

For example, you would start with "a sasas b". After the first match, you have "sas". Taking the substring that starts one character after the match starts, we would have "asas b". The next match would find the "sas" here, and you would again repeat the process with "as b". This would fail to match, so you would be done.

Community
  • 1
  • 1
cdhowie
  • 158,093
  • 24
  • 286
  • 300
1

This significantly-improved answer owes itself to @EliGassert.

String.prototype.match_overlap = function(re)
    {
        if (!re.global)
            re = new RegExp(re.source,
                            'g' + (re.ignoreCase ? 'i' : '')
                                + (re.multiline  ? 'm' : ''));
        var matches = [];
        var result;
        while (result = re.exec(this))
            matches.push(result),
            re.lastIndex = result.index + 1;
        return matches.length ? matches : null;
    }

@EliGassert points out that there is no need to walk through the entire string character by character; instead we can find a match anywhere (i.e. do without the anchor), and then continue one character after the index of the found match. While researching how to retrieve said index, I found that the re.lastIndex property, used by exec to keep track of where it should continue its search, is in fact settable! This works rather nicely with what we intend to do.

The only bit needing further explanation might be the beginning. In the absence of the g flag, exec may never return null (always returning its one match, if it exists), thus possibly going into an infinite loop. Since, however, match_overlap by design seeks multiple matches, we can safely recompile any non-global RegExp as a global RegExp, importing the i and m options as well if set.

Here is a new jsFiddle: http://jsfiddle.net/acheong87/h5MR5/.

document.write("<pre>");
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");​

Output:

sas,sas
aa,aa,aa
my1na,me2is,is3pi
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
0
var match = "a sasas b".match(/s(?=as)/g);

for(var i =0; i != match.length; ++i)
    alert(match[i]);

Going off of the comment by Q. Sheets and the response by cdhowie, I came up with the above solution: it consumes ONE character in the regular expression and does a lookahead for the rest of the match string. With these two pieces, you can construct all the positions and matching strings in your regular expression.

I wish there was an "inspect but don't consume" operator that you could use to actually include the rest of the matching (lookahead) string in the results, but there unfortunately isn't -- at least not in JS.

Eli Gassert
  • 9,745
  • 3
  • 30
  • 39
0

Here's a generic way to do it:

​String.prototype.match_overlap = function(regexp)
    {
        regexp = regexp.toString().replace(/^\/|\/$/g, '');
        var re = new RegExp('^' + regexp);
        var matches = [];
        var result;
        for (var i = 0; i < this.length; i++)
            if (result = re.exec(this.substr(i)))
                matches.push(result);
        return matches.length ? matches : null;
    }

Usage:

var results = 'sasas'.match_overlap(/sas/);

Returns:

  • An array of (overlapping) matches, or null.

Example:

Here's a jsFiddle in which this:

document.write("<pre>");​
document.write('sasas'.match_overlap(/sas/));
document.write("\n");
document.write('aaaa'.match_overlap(/aa/));
document.write("\n");
document.write('my1name2is3pilchard'.match_overlap(/[a-z]{2}[0-9][a-z]{2}/));
document.write("</pre>");​

returns this:

sas,sas
aa,aa,aa
my1na,me2is,is3pi

Explanation:

To explain a little bit, we intend for the user to pass a RegExp object to this new function, match_overlap, as he or she would do normally with match. From this we want to create a new RegExp object anchored at the beginning (to prevent duplicate overlapped matches—this part probably won't make sense unless you encounter the issue yourself—don't worry about it). Then, we simply match against each substring of the subject string this and push the results to an array, which is returned if non-empty (otherwise returning null). Note that if the user passes in an expression that is already anchored, this is inherently wrong—at first I stripped anchors out, but then I realized I was making an assumption in the user's stead, which we should avoid. Finally one could go further and somehow merge the resulting array of matches into a single match result resembling what would normally occur with the //g option; and one could go even further and make up a new flag, e.g. //o that gets parsed to do overlap-matching, but this is getting a little crazy.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • Correct me if I'm wrong, but you're walking the whole string every time. This could get VERY expensive on long strings. You'd be better off doing a search for a single instance and re-starting (setting `i`) to the first character's index (+1) of the last match. I see you're doing ^ on the match, so it'll at least short circuit quickly. I haven't done perf checks to know if this is truly an impact, but I at least wanted to bring it up. – Eli Gassert Nov 09 '12 at 20:26
  • @EliGassert - You are absolutely right. I posted a new solution, separately. I don't want to vote up your answer to this question because I feel it is not a _general_ solution. However, I'd like to show my thanks equivalently by looking through some of your past answers and voting up any that I find personally useful and/or interesting. – Andrew Cheong Nov 09 '12 at 21:23