8

I'm writing a function-parsing engine which uses regular expressions to separate the individual terms (defined as a constant or a variable followed (optionally) by an operator). It's working great, except when I have grouped terms within other grouped terms. Here's the code I'm using:

//This matches an opening delimiter
Regex openers = new Regex("[\\[\\{\\(]");

//This matches a closing delimiter
Regex closers = new Regex("[\\]\\}\\)]");

//This matches the name of a variable (\w+) or a constant numeric value (\d+(\.\d+)?)
Regex VariableOrConstant = new Regex("((\\d+(\\.\\d+)?)|\\w+)" + FunctionTerm.opRegex + "?");

//This matches the binary operators +, *, -, or /
Regex ops = new Regex("[\\*\\+\\-/]");

//This compound Regex finds a single variable or constant term (including a proceeding operator,
//if any) OR a group containing multiple terms (and their proceeding operators, if any)
//and a proceeding operator, if any.
//Matches that match this second pattern need to be added to the function as sub-functions,
//not as individual terms, to ensure the correct evalutation order with parentheses.
Regex splitter = new Regex(
openers + 
"(" + VariableOrConstant + ")+" + closers + ops + "?" +
"|" +
"(" + VariableOrConstant + ")" + ops + "?");

When "splitter" is matched against the string "4/(2*X*[2+1])", the matches' values are "4/", "2*", "X*", "2+", and "1", completely ignoring all of the delimiting parentheses and braces. I believe this is because the second half of the "splitter" Regex (the part after the "|") is being matched and overriding the other part of the pattern. This is bad- I want grouped expressions to take precedence over single terms. Does anyone know how I can do this? I looked into using positive/negative lookaheads and lookbehinds, but I'm honestly not sure how to use those, or what they're even for, for that matter, and I can't find any relevant examples... Thanks in advance.

Michael Hoffmann
  • 2,411
  • 2
  • 24
  • 42
  • 1
    Have you looked into using something like Antlr? That's something more specialized for parsing / lexing. –  Dec 13 '10 at 01:55
  • What is FunctionTerm.opRegex? – Ryan Pedersen Dec 13 '10 at 02:01
  • FunctionTerm.opRegex is the same as "ops"- it's used by the Parse method in my FunctionTerm class. I thought I replaced that with "ops" to simplify the code, but I guess I missed that one- sorry. – Michael Hoffmann Dec 13 '10 at 02:35
  • 1
    If you parse "4+-1", what do you get? I think you'll get "4+" and "1" rather than "4+" and "-1" because your splitter is stealing the "-" and discarding it in the second part. I think you need to remove ` + ops + "?"`. – AlastairG Dec 13 '10 at 12:15
  • What language is this? I believe Perl regular expressions has a "matching brace" expression that would allow you to extract a bracketed section as you seem to want. – AlastairG Dec 13 '10 at 12:17
  • 2
    IMO, this is not a job for regex but for a parser (either *hand-rolled* , or generated with a parser generator like ANTLR, as @yodaj007 suggested). [Here](http://stackoverflow.com/questions/4396080/antlr-3-3-c-tutorials/4397799#4397799)'s a little example of how to use ANTLR with C#. – Bart Kiers Dec 13 '10 at 14:59
  • This question has been answered (and answered well- thanks :) ), but I just thought I'd clarify and comment for posterity. The language is C#, and AlastairG was right- attempting to parse "4+-1" yielded "4+" and "-1", which broke everything even more :P Thanks for pointing that out. – Michael Hoffmann Dec 23 '10 at 22:12

2 Answers2

5

You didn't show us how you're applying the regex, so here's a demo I whipped up:

private static void ParseIt(string subject)
{
  Console.WriteLine("subject : {0}\n", subject);

  Regex openers = new Regex(@"[[{(]");
  Regex closers = new Regex(@"[]})]");
  Regex ops = new Regex(@"[*+/-]");
  Regex VariableOrConstant = new Regex(@"((\d+(\.\d+)?)|\w+)" + ops + "?");

  Regex splitter = new Regex(
    openers + @"(?<FIRST>" + VariableOrConstant + @")+" + closers + ops + @"?" +
    @"|" +
    @"(?<SECOND>" + VariableOrConstant + @")" + ops + @"?",
    RegexOptions.ExplicitCapture
  );

  foreach (Match m in splitter.Matches(subject))
  {
    foreach (string s in splitter.GetGroupNames())
    {
      Console.WriteLine("group {0,-8}: {1}", s, m.Groups[s]);
    }
    Console.WriteLine();
  }
}

output:

subject : 4/(2*X*[2+1])

group 0       : 4/
group FIRST   :
group SECOND  : 4/

group 0       : 2*
group FIRST   :
group SECOND  : 2*

group 0       : X*
group FIRST   :
group SECOND  : X*

group 0       : [2+1]
group FIRST   : 1
group SECOND  :

As you can see, the term [2+1] is matched by the first part of the regex, as you intended. It can't do anything with the (, though, because the next bracketing character after that is another "opener" ([), and it's looking for a "closer".

You could use .NET's "balanced matching" feature to allow for grouped terms enclosed in other groups, but it's not worth the effort. Regexes are not designed for parsing--in fact, parsing and regex matching are fundamentally different kinds of operation. And this is a good example of the difference: a regex actively seeks out matches, skipping over anything it can't use (like the open-parenthesis in your example), but a parser has to examine every character (even if it's just to decide to ignore it).

About the demo: I tried to make the minimum functional changes necessary to get your code to work (which is why I didn't correct the error of putting the + outside the capturing group), but I also made several surface changes, and those represent active recommendations. To wit:

  • Always use verbatim string literals (@"...") when creating regexes in C# (I think the reason is obvious).
  • If you're using capturing groups, use named groups whenever possible, but don't use named groups and numbered groups in the same regex. Named groups save you the hassle of keeping track of what's captured where, and the ExplicitCapture option saves you having to clutter up the regex with (?:...) wherever you need a non-capturing group.

Finally, that whole scheme of building a large regex from a bunch of smaller regexes has very limited usefulness IMO. It's very difficult to keep track of the interactions between the parts, like which part's inside which group. Another advantage of C#'s verbatim strings is that they're multiline, so you can take advantage of free-spacing mode (a.k.a. /x or COMMENTS mode):

  Regex r = new Regex(@"
    (?<GROUPED>
      [[{(]                  # opening bracket
      (                      # group containing:
        ((\d+(\.\d+)?)|\w+)     # number or variable
        [*+/-]?                 # and proceeding operator
      )+                     # ...one or more times
      []})]                  # closing bracket
      [*+/-]?                # and proceeding operator
    )
    |
    (?<UNGROUPED>
      ((\d+(\.\d+)?)|\w+)    # number or variable
      [*+/-]?                # and proceeding operator
    )
    ",
    RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace
  );

This is not intended as a solution to your problem; as I said, that's not a job for regexes. This is just a demonstration of some useful regex techniques.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Thanks for the example. This explains a lot. This is also the first time I've seen the "comments mode"- that's very useful. I'd forgotten about verbatim strings, too. It looks like I'll have to change my parsing methods to actually parse :P Thanks for the feedback and help. – Michael Hoffmann Dec 15 '10 at 17:22
3

try using difrent quantifiers

greedy:

*  +  ?

possessive:

*+ ++ ?+

lazy:

*? +? ??

Try reading this and this

also maybe non-capturing group:

(?:your expr here)

try try try! practice make perfect! :)

Community
  • 1
  • 1
marverix
  • 7,184
  • 6
  • 38
  • 50
  • 2
    I'm afraid this answer isn't very helpful - you've mentioned the basics of regular expression without even trying to solve the problem. Had you tried, by the way, you'd discover simple regular expressions cannot solve it properly, unless you have a reasonable limitation on depth of nested parentheses, or awesome extensions to the regex engine, like recursive matching. – Kobi Dec 13 '10 at 16:03