1

I'm stuck on a RegEx problem that's seemingly very simple and yet I can't get it working.

Suppose I have input like this:

Some text %interestingbit% lots of random text lots and lots more %anotherinterestingbit%
Some text %interestingbit% lots of random text OPTIONAL_THING lots and lots more %anotherinterestingbit%
Some text %interestingbit% lots of random text lots and lots more %anotherinterestingbit%

There are many repeating blocks in the input and in each block I want to capture some things that are always there (%interestingbit% and %anotherinterestingbit%), but there is also a bit of text that may or may not occur in-between them (OPTIONAL_THING) and I want to capture it if it's there.

A RegEx like this matches only blocks with OPTIONAL_THING in it (and the named capture works):

%interestingbit%.+?((?<OptionalCapture>OPTIONAL_THING)).+?%anotherinterestingbit%

So it seems like it's just a matter of making the whole group optional, right? That's what I tried:

%interestingbit%.+?((?<OptionalCapture>OPTIONAL_THING))?.+?%anotherinterestingbit%

But I find that although this matches all 3 blocks the named capture (OptionalCapture) is empty in all of them! How do I get this to work?

Note that there can be a lot of text within each block, including newlines, which is why I put in ".+?" rather than something more specific. I'm using .NET regular expressions, testing with The Regulator.

EMP
  • 59,148
  • 53
  • 164
  • 220

3 Answers3

2

My thoughts are along similar lines to Niko's idea. However, I would suggest placing the 2nd .+? inside the optional group instead of the first, as follows:

%interestingbit%.+?(?:(?<optionalCapture>OPTIONAL_THING).+?)?%anotherinterestingbit%

This avoids unnecessary backtracking. If the first .+? is inside the optional group and OPTIONAL_THING does not exist in the search string, the regex won't know this until it gets to the end of the string. It will then need to backtrack, perhaps quite a bit, to match %anotherinterestingbit%, which as you said will always exist.

Also, since OPTIONAL_THING, when it exists, will always be before %anotherinterestingbit%, then the text after it is effectively optional as well and fits more naturally into the optional group.

Bryan
  • 2,870
  • 24
  • 39
  • 44
0

Why do you have the extra set of parentheses?

Try this:

%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING)?.+?%anotherinterestingbit%

Or maybe this will work:

%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING|).+?%anotherinterestingbit%

In this example, the group captures OPTIONAL_THING, or nothing.

strager
  • 88,763
  • 26
  • 134
  • 176
  • Nope, sorry, neither of these work. They're the same as my regex with the group being optional - all 3 blocks match, but without OPTIONAL_THING being captured. – EMP Jan 03 '09 at 03:01
  • @Evgeny, Are you sure .+? is making the wildcard "ungreedy?" Perhaps you can try .*? instead. – strager Jan 03 '09 at 03:06
  • @Evgeny, Do any of the regex's work as expected when you turn the named group into a non-named/numbered group? Also, another option is doing something like /(currently working regex here|regex without OPTIONAL_THING here)/. – strager Jan 03 '09 at 03:14
  • @strager, no, whether it's named or not makes no difference. The big | doesn't work either, because it produces 2 matches for the above input with the first match being from the start of the first block to the end of the second one. – EMP Jan 03 '09 at 03:30
  • To me the problem seems to be at the first non-greedy match pattern. You're in effect matching up to OPTIONAL_THING or nothing, so the first .+? instantly finds "nothing" and stops matching. Because OPTIONAL_THING doesn't come right after, the second .+? matches the rest of the input. Right..? – Niko Nyman Jan 03 '09 at 11:56
0

Try this:

%interestingbit%(?:(.+)(?<optionalCapture>OPTIONAL_THING))?(.+?)%anotherinterestingbit%

First there's a non-capturing group which matches .+OPTIONAL_THING or nothing. If a match is found, there's the named group inside, which captures OPTIONAL_THING for you. The rest is captured with .+?%anotherinterestingbit%.

[edit]: I added a couple of parentheses for additional capture groups, so now the captured groups match the following:

  • $1 : text before OPTIONAL_THING or nothing
  • $2 or $optionalCapture : OPTIONAL_THING or nothing
  • $3 : text after OPTIONAL_THING, or if OPTIONAL_THING is not found, the full text between %interestingbit% and %anotherinterestingbit%

Are these the three matches you're looking for?

Niko Nyman
  • 1,916
  • 2
  • 16
  • 26
  • Sorry, but this has the same issue as using one big "|" - the first match includes two blocks, so there are only 2 matches in total, not 3. – EMP Jan 03 '09 at 23:46
  • Oooops.. edited my answer before noticing there was a new answer ABOVE my answer. Another thing learned about Stack Overflow -- the answers are not in chronological order... – Niko Nyman Jan 05 '09 at 09:00