2

I have to find occurrences of a certain string (needle) within another string (haystack) that don't occur between specific "braces".

For example consider this haystack: "BEGIN something END some other thing BEGIN something else END yet some more things." And this needle: "some" With the braces "BEGIN" and "END"

I want to find all needles that are not between braces. (there are two matches: the "some" followed by "other" and the "some" followed by "more")

I figured I could solve this with a Regex with negative lookahhead/lookbehind, but how?

I have tried

(?<!(BEGIN))some(?!(END))

which gives me 4 matches (obviously because no "some" is directly enclosed between "BEGIN" and "END")

I also tried

(?<!(BEGIN.*))some(?!(.*END))

but this gives me no matches at all (obviously because each needle is somehow preceeded by a "BEGIN")

No I'm stuck.

Here's the latest C# code I used:

string input = "BEGIN something END some other thing BEGIN something else END yet some more things.";
global::System.Text.RegularExpressions.Regex re = new Regex(@"(?<!(BEGIN.*))some(?!(.*END))");
global::System.Text.RegularExpressions.MatchCollection matches = re.Matches(input);
global::NUnit.Framework.Assert.AreEqual(2, matches.Count);
miasbeck
  • 1,046
  • 9
  • 14

4 Answers4

1

Would something like this work for you:

(?:^|END)((?!BEGIN).*?)(some)(.*?)(?:BEGIN|$)

This appears to match your text, as I tested using RegExDesigner.NET.

David Paxson
  • 553
  • 1
  • 3
  • 8
  • The above expression does it! Thanks alot. I only had to get Group[2].Value instead of Group[0].Value, but that's fine. Thanks also for mentioning RegExDesigner. I hadn't heard about it before. – miasbeck Apr 08 '11 at 16:27
  • I think this expression doesn't work if you have multiple `some`s between the same `end` and `begin` - "some some END BEGIN some some" – Kobi Apr 08 '11 at 16:43
1

One simple option is to skip the parts you don't want to match, and capture only the needles you need:

MatchCollection matches = Regex.Matches(input, "BEGIN.*?END|(?<Needle>some)");

You'll get the two "some"s you're after by taking the successful "Needle" groups out of all matches:

IEnumerable<Group> needles = matches.Cast<Match>()
                                    .Select(m => m.Groups["Needle"])
                                    .Where(g => g.Success);
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • +1, this is pretty clever. Have you tested it? I can tell your idea is that the alternation operator (`|`) would make anything matching `BEGIN.*?END` short-circuit itself out of the capture group, but I didn't think alternation was short-circuiting in regular expressions. – Justin Morgan - On strike Apr 08 '11 at 20:04
  • Update: It does work. http://rubular.com/r/6mKSumbyuF. I'm definitely going to remember this trick. – Justin Morgan - On strike Apr 08 '11 at 20:08
  • @Justin - Thanks! This isn't really about short-circuiting, it's about how the matching engine works - if it finds a match for a `start-end` block, it wouldn't search and capture `some`. I have some explanations of that [here](http://stackoverflow.com/questions/5153980/#5154081), [here](http://stackoverflow.com/questions/4383068/4384901#4384901) and [here](http://stackoverflow.com/questions/5283269/#5288185). – Kobi Apr 08 '11 at 21:54
0

You might try splitting the string on occurrences of BEGIN or END so that you can insure that there is only one BEGIN and one END in the string that you apply your regex to. Also, if you are looking for occurrences of SOME that are outside your BEGIN/END braces then I think you'd want to look behind for END and lookahead for BEGIN (positive lookahead/behind), the opposite of what you have.

Hope this helps.

0

What if you just process the entire haystack and ignore the hay that is in between the braces (am I pushing the metaphor too far?)

For example, look through all the tokens (or characters, if you need to go to that level) and look for your braces. When the opening one is found, you loop through until you find the closing brace. At that point, you start looking for your needles until you find another opening brace. It's a bit more code than a Regex, but might be more readible and easier to troubleshoot.

Ken Pespisa
  • 21,989
  • 3
  • 55
  • 63