What is a better way for matching groups in regular expressions by group name

Question

I have read How do I get the name of captured groups in a C# Regex? and How do I access named capturing groups in a .NET Regex? to try to understand how to find the result of a matched group in regular expressions.

I've also read everything in the MSDN at http://msdn.microsoft.com/en-us/library/30wbz966.aspx

It seems strange to me is how C# (or .NET) seems to be the only implementation of regular expressions that makes you iterate groups to find which group matched (especially if you need the name), and also the fact that the name isn't stored with the group result. PHP and Python for example will give you the group name that matched as part of the RegEx match result.

I have to iterate the groups and check for a match, and I have to keep a list of my own group names cause the names aren't in the result.

Here is my code to demonstrate:

public class Tokenizer
{
    private Dictionary<string, string> tokens;

    private Regex re;

    public Tokenizer()
    {
        tokens = new Dictionary<string, string>();
        tokens["NUMBER"] = @"\d+(\.\d*)?";  // Integer or decimal number
        tokens["STRING"] = @""".*""";       // String
        tokens["COMMENT"] = @";.*";         // Comment
        tokens["COMMAND"] = @"[A-Za-z]+";   // Identifiers
        tokens["NEWLINE"] = @"\n";          // Line endings
        tokens["SKIP"] = @"[ \t]";          // Skip over spaces and tabs

        List<string> token_regex = new List<string>();
        foreach (KeyValuePair<string, string> pair in tokens)
        {
            token_regex.Add(String.Format("(?<{0}>{1})", pair.Key, pair.Value));
        }
        string tok_regex = String.Join("|", token_regex);

        re = new Regex(tok_regex);
    }

    public List<Token> parse(string pSource)
    {
        List<Token> tokens = new List<Token>();

        Match get_token = re.Match(pSource);
        while (get_token.Success)
        {
            foreach (string gname in this.tokens.Keys)
            {
                Group group = get_token.Groups[gname];
                if (group.Success)
                {
                    tokens.Add(new Token(gname, get_token.Groups[gname].Value));
                    break;
                }
            }

            get_token = get_token.NextMatch();
        }
        return tokens;
    }
}

In the line

foreach (string gname in this.tokens.Keys)

That should not be necessary but it is.

Is there anyway to find the matching group and it's name without having to iterate all the groups?

EDIT: To compare implementations. Here is the same code that I wrote for a Python implementation.

class xTokenizer(object):
    """
    xTokenizer converts a text source code file into a collection of xToken objects.
    """

    TOKENS = [
        ('NUMBER',  r'\d+(\.\d*)?'),    # Integer or decimal number
        ('STRING',  r'".*"'),           # String
        ('COMMENT', r';.*'),            # Comment
        ('VAR',     r':[A-Za-z]+'),     # Variables
        ('COMMAND', r'[A-Za-z]+'),      # Identifiers
        ('OP',      r'[+*\/\-]'),       # Arithmetic operators
        ('NEWLINE', r'\n'),             # Line endings
        ('SKIP',    r'[ \t]'),          # Skip over spaces and tabs
        ('SLIST',   r'\['),             # Start a list of commands
        ('ELIST',   r'\]'),             # End a list of commands
        ('SARRAY',  r'\{'),             # Start an array
        ('EARRAY',  r'\}'),             # End end an array
    ]

    def __init__(self,tokens=None):
        """
        Constructor
            Args:
                tokens - key/pair of regular expressions used to match tokens.
        """
        if tokens is None:
            tokens = self.TOKENS
        self.tokens = tokens
        self.tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in tokens)
        pass

    def parse(self,source):
        """
        Converts the source code into a list of xToken objects.
            Args:
                sources - The source code as a string.
            Returns:
                list of xToken objects.
        """
        get_token = re.compile(self.tok_regex).match
        line = 1
        pos = line_start = 0
        mo = get_token(source)
        result = []
        while mo is not None:
            typ = mo.lastgroup
            if typ == 'NEWLINE':
                line_start = pos
                line += 1
            elif typ != 'SKIP':
                val = mo.group(typ)
                result.append(xToken(typ, val, line, mo.start()-line_start))
            pos = mo.end()
            mo = get_token(source, pos)
        if pos != len(source):
            raise xParserError('Unexpected character %r on line %d' %(source[pos], line))
        return result

As you can see Python doesn't require you to iterate the groups, and a similar thing can be done in PHP and I assume Java.

score 1 · Answer 1 · answered Sep 15 '12 at 18:35

1

All your token types start with different characters. How about compiling a HashSet<char,string> that maps all possible start characters to the matching group name? That way you only have to examine the first character of the entire match to figure out which group was matched.

answered Sep 15 '12 at 18:35

Wormbo

4,978
2
21
41

True, this optimizes the search for matching groups, but doesn't answer the question why is this necessary. – Reactgular Sep 15 '12 at 18:41
Your question was "Is there anyway to find the matching group and it's name without having to iterate all the groups?" and that's what I answered. A possible answer to the "why" could be that the Regex class is not the best tool to build lexers with. There are better ways to do that with other tools. – Wormbo Sep 15 '12 at 19:16
Sorry, but I disagree because your recommendation was to use the first character in a match as an index into a HashSet. I don't think this is a good approach at all. – Reactgular Sep 15 '12 at 19:20
That's an O(1) approach without any iteration. What kind of approach did you have in mind? – Wormbo Sep 15 '12 at 19:23
I know that I'll be updating the regex and possible tokens in the future. So I can't assume the first character will be unique. – Reactgular Sep 15 '12 at 20:36
And that will be the point when a Regex is no longer a good base for your lexer. Internally it may do what any lexer has to do, but it does so without exposing necessary state information. – Wormbo Sep 16 '12 at 05:39

score 1 · Accepted Answer · answered Sep 15 '12 at 18:52

1

There's no need to maintain a separate list of named groups. Use the Regex.GetGroupNames method instead.

Your code would then look similar to this:

foreach (string gname in re.GetGroupNames())
{
    Group group = get_token.Groups[gname];
    if (group.Success)
    {
        // your code
    }
}

That said, be aware of this note on the MSDN page:

Even if capturing groups are not explicitly named, they are automatically assigned numerical names (1, 2, 3, and so on).

With that in mind, you should either name all your groups, or filter out numeric group names. You could do so with some LINQ, or with an additional check that !Char.IsNumber(gname[0]) to check the first character of the group name, making the assumption that any such group is invalid. Alternately, you could also use the int.TryParse method.

answered Sep 15 '12 at 18:52

Ahmad Mageed

94,561
19
163
174

True, except re.GetGroupNames() contains two extra groups. The first group named "0" will be marked as (Success == true) and is a copy of the matching group, and group named "1" is the entire result with (Success == false). Why is this there? I have no idea. Another issue with .Net I assume. – Reactgular Sep 15 '12 at 18:54
+1 for explaining numerical names. I could not find any reference to that in the MSDN. thanks. Do you have a link to where that is stated? – Reactgular Sep 15 '12 at 18:56
Never mind, it's located here. http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.getgroupnames.aspx – Reactgular Sep 15 '12 at 18:57
Has to be the correct answer. You can't access groups by name in .NET, but if you use numeric names then resulting groups in the match are numbered the same. – Reactgular Sep 18 '12 at 21:50

What is a better way for matching groups in regular expressions by group name

2 Answers2