49

I can't seem to find an answer to this problem, and I'm wondering if one exists. Simplified example:

Consider a string "nnnn", where I want to find all matches of "nn" - but also those that overlap with each other. So the regex would provide the following 3 matches:

  1. nnnn
  2. nnnn
  3. nnnn

I realize this is not exactly what regexes are meant for, but walking the string and parsing this manually seems like an awful lot of code, considering that in reality the matches would have to be done using a pattern, not a literal string.

PhiLho
  • 40,535
  • 6
  • 96
  • 134
jevakallio
  • 35,324
  • 3
  • 105
  • 112
  • Thank you for adding this question. I was not even sure about how to state this problem in a way that other people could understand it! – Efe Zaladin Aug 01 '21 at 21:22

3 Answers3

34

Update 2016:

To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).

(?=(nn))

Original answer (2008)

A possible solution could be to use a positive look behind:

(?<=n)n

It would give you the end position of:

  1. nnnn  
  2. nnnn  
  3. nnnn

As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)

I would prefer to his proposition (?=nn)n the simpler form:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2).

That is so because:

  • Any valid regular expression can be used inside the lookahead.
  • If it contains capturing parentheses, the backreferences will be saved.

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).


VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • 2
    Also, you could have done it with a positive look ahead: (?=nn)n ... that says "while ahead is two N's, match an N". – Timothy Khouri Nov 26 '08 at 12:23
  • Excuse me, but I still don't see the requested three overlapping captures. You capture two n, but not three groups. If I match (\d\d)(?=(\d\d)) against foo4237bar, I get two captures, not three: 42 and 37 (in both Regex Coach and PCRE Workbench). I am probably thick, so I need more explanations. – PhiLho Nov 26 '08 at 22:47
  • Please read again the answer: (\d)(?=(\d)), not (\d\d)(?=(\d\d)): you will have 3 sets of capturing groups: (4)(2), (2)(3), (3)(7) – VonC Nov 26 '08 at 22:57
  • 2
    Why not just `(?=(nn))`? Then you get a single capture group for each match – SDJMcHattie Jan 19 '16 at 11:03
  • I may be missing something, but @SDJMcHattie comment looks like _the_ answer: it will work for an arbitrary expression, not just for a sequence of `n`s – Andrew Savinykh Nov 29 '21 at 00:30
  • @AndrewSavinykh Agreed. I must have missed this 2016 comment on my 2008 answer. I have included the comment in the answer. – VonC Nov 29 '21 at 07:00
29

Using a lookahead with a capturing group works, at the expense of making your regex slower and more complicated. An alternative solution is to tell the Regex.Match() method where the next match attempt should begin. Try this:

Regex regexObj = new Regex("nn");
Match matchObj = regexObj.Match(subjectString);
while (matchObj.Success) {
    matchObj = regexObj.Match(subjectString, matchObj.Index + 1); 
}
Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • Regular-Expressions.info webmaster... => mandatory + 1. Plus, you are right, of course. – VonC Nov 26 '08 at 19:28
2

AFAIK, there is no pure regex way to do that at once (ie. returning the three captures you request without loop).

Now, you can find a pattern once, and loop on the search starting with offset (found position + 1). Should combine regex use with simple code.

[EDIT] Great, I am downvoted when I basically said what Jan shown...
[EDIT 2] To be clear: Jan's answer is better. Not more precise, but certainly more detailed, it deserves to be chosen. I just don't understand why mine is downvoted, since I still see nothing incorrect in it. Not a big deal, just annoying.

PhiLho
  • 40,535
  • 6
  • 96
  • 134
  • Beat me to it by 1 second, I'll withdraw my identical answer! – Simon Steele Nov 26 '08 at 11:58
  • @Timothy: that won't do the capture, and you still have to loop on the results, so I am not sure of the advantages... – PhiLho Nov 26 '08 at 13:42
  • @PhiLho: again, not true: you can capture group in a zero-width assertion like a positive look-ahead. See my - completed - answer. – VonC Nov 26 '08 at 14:23
  • @PhiLho: I responded to your comment. And, in my opinion, your answer was less precise than Jan's: "the pattern" could refer to 'n', whereas the correct strategy means using 'nn', then start again at offset+1. You may have meant that all along, you just did not explain it. – VonC Nov 26 '08 at 23:01
  • @VonC: the question is precise, the pattern have been "nn" all along, I don't see an ambiguity there. – PhiLho Nov 27 '08 at 01:08