0

I'm new to regex and was hoping for a pointer towards finding matches for words which are between { } brackets which are words and the first letter is uppercase and the second is lowercase. So I want to ignore any numbers also words which contain numbers

{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}

so I would only want to bring back matches for:

Test
Tesgd
Abc

I've looked at using \b and \w for words that are bound and [Az] for upper followed by lower but not sure how to only get the words which are between the brackets only as well.

Toto
  • 89,455
  • 62
  • 89
  • 125
user1186144
  • 133
  • 1
  • 9
  • 1
    Is it possible to have nested { } brackets? example: { {aa, bb} cc } , dd – Ali Ferhat Feb 03 '12 at 00:01
  • "the second is lowercase" is there always a second letter? can the third letterbe upper case again, or is it all the rest is lower case? – Ali Ferhat Feb 03 '12 at 00:02
  • It is possible to have nested brackers, yeah sorry I should have said that all the rest should be lower case after the first uppercase – user1186144 Feb 03 '12 at 00:04
  • is there (a) always (b) sometimes (c) never a space after an opening bracket? – Ali Ferhat Feb 03 '12 at 00:08
  • 1
    While it may be possible to do this with Regexes, why do you want to stick with them? The code will be harder to write, probably run slower than regular parsing, and much harder to debug or change when you come back to the code in the future. I would tokenize the string, parse through keeping track of bracket depth, and maybe use regexes to test individual words. – TheEvilPenguin Feb 03 '12 at 00:10
  • 3
    Here is something to read for nested braces - http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns . – Alexei Levenkov Feb 03 '12 at 00:18
  • No no no no no no no don't use regex to match nested patterns. No no no no no! http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns – Robert P Feb 03 '12 at 20:50

3 Answers3

3

Here is your solution:

Regex r = new Regex(@"(?<={[^}]*?({(?<depth>)[^}]*?}(?<-depth>))*?[^}]*?)(?<myword>[A-Z][a-z]+?)(?=,|}|\Z)", RegexOptions.ExplicitCapture);
string s = "{ test1, Test2, Test, 1213, Tsg12, Tesgd} , test5, test6, {abc, Abc}";
var m = r.Matches(s);
foreach (Match match in m)
   Console.WriteLine(match.Groups["myword"].Value);

I assumed it is OK to match inside but not the deepest level paranthesis. Let's dissect the regex a bit. AAA means an arbitrary expression. www means an arbitrary identifier (sequence of letters)

  • . is any character
  • [A-Z] is as you can guess any upper case letter.
  • [^}] is any character but }
  • ,|}|\Z means , or } or end-of-string
  • *? means match what came before 0 or more times but lazily (Do a minimal match if possible and spit what you swallowed to make as many matches as possible)
  • (?<=AAA) means AAA should match on the left before you really try to match something.
  • (?=AAA) means AAA should match on the right after you really match something.
  • (?<www>AAA) means match AAA and give the string you matched the name www. Only used with ExplicitCapture option.
  • (?<depth>) matches everything but also pushes "depth" on the stack.
  • (?<-depth>) matches everything but also pops "depth" from the stack. Fails if the stack is empty.

We use the last two items to ensure that we are inside a paranthesis. It would be much simpler if there were no nested paranthesis or matches occured only in the deepest paranthesis.

The regular expression works on your example and probably has no bugs. However I tend to agree with others, you should not blindly copy what you cannot understand and maintain. Regular expressions are wonderful but only if you are willing to spend effort to learn them.

Edit: I corrected a careless mistake in the regex. (replaced .*? with [^}]*? in two places. Morale of the story: It's very easy to introduce bugs in Regex's.

Ali Ferhat
  • 2,511
  • 17
  • 24
  • This doesn't work. In the OP's sample string the only words that aren't enclosed in braces are `test5` and `test6`, which also fail to meet the other criteria: they don't start with capitals and they do contain digits. Replace one of them with `Testx` and you'll see it gets flagged as a match even though it's not enclosed in braces. – Alan Moore Feb 03 '12 at 01:28
  • +1. Nice detailed explanation. Try to never do it in real life so :). – Alexei Levenkov Feb 03 '12 at 02:29
  • Voted up for managing to craft a regex to solve the problem and also providing a good explanation of how it works. If I ever saw this in code I had to maintain, though, I would be a sad developer. – TheEvilPenguin Feb 03 '12 at 03:17
  • Consider using the Extended Whitespace modifier, and putting your comments in-line: http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Whitespace – Robert P Feb 03 '12 at 20:52
0

In answer your original question, I would have offered this regex:

\b[A-Z][a-z]+\b(?=[^{}]*})

The last part is a positive lookahead; it notes the current match position, tries to match the enclosed subexpression, then returns the match position to where it started. In this case, it starts at the end of the word that was just matched and gobbles up as many characters it can as long as they're not { or }. If the next character after that is }, it means the word is inside a pair of braces, so the lookahead succeeds. If the next character is {, or if there's no next character because it's at the end of the string, the lookahead fails and the regex engine moves on to try the next word.

Unfortunately, that won't work because (as you mentioned in a comment) the braces may be nested. Matching any kind of nested or recursive structure is fundamentally incompatible with the way regexes work. Many regex flavors offer that capability anyway, but they tend to go about it in wildly different ways, and it's always ugly. Here's how I would do this in C#, using Balanced Groups:

  Regex r = new Regex(@"
      \b[A-Z][a-z]+\b
      (?!
        (?>
          [^{}]+
          |
          { (?<Open>)
          |
          } (?<-Open>)
        )*
        $
        (?(Open)(?!))
      )", RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace);
  string s = "testa Testb { Test1 Testc testd 1Test } Teste { Testf {testg Testh} testi } Testj";
  foreach (Match m in r.Matches(s))
  {
    Console.WriteLine(m.Value);
  }

output:

Testc
Testf
Testh

I'm still using a lookahead, but this time I'm using the group named Open as a counter to keep track of the number of opening braces relative to the number of closing braces. If the word currently under consideration is not enclosed in braces, then by the time the lookahead reaches the end of the string ($), the value of Open will be zero. Otherwise, whether it's positive or negative, the conditional construct - (?(Open)(?!)) - will interpret it as "true" and try to try to match (?!). That's a negative lookahead for nothing, which is guaranteed to fail; it's always possible to match nothing.

Nested or not, there's no need to use a lookbehind; a lookahead is sufficient. Most flavors place such severe restrictions on lookbehinds that nobody would even think to try using them for a job like this. .NET has no such restrictions, so you could do this in a lookbehind, but it wouldn't make much sense. Why do all that work when the other conditions--uppercase first letter, no digits, etc--are so much cheaper to test?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
-1

Do the filtering in two steps. Use the regular expression

@"\{(.*)\}"

to pull out the pieces between the brackets, and the regular expression

@"\b([A-Z][a-z]+)\b"

to pull out each of the words that begins with a capital letter and is followed by lower case letters.

Adam Mihalcin
  • 14,242
  • 4
  • 36
  • 52