Regex - MARK to see which capture group was matched

Question

I'm writing a small tokenizer in C#.

In the PCRE Regex specification there's the neat MARK keyword:
https://pcre.org/current/doc/html/pcre2syntax.html#SEC23

This is how it works:

https://3v4l.org/ErCrp

<?php

$string = 'bar';
$matches = [];

preg_match('~(?|foo(*:1)
               |bar(*:2)
               |baz(*:3))~x', $string, $matches);

var_dump($matches);

//> array(2) { 
//>     [0]=> string(3) "bar" 
//>     ["MARK"]=> string(1) "2" 
//> }

As you can see, the MARK parameter in the result set allows you to see which branch of the regular expression was actually matched. Unfortunately, the MARK keyword is not supported in .NETs Regex framework. This is what I'm doing right now:

var pattern = @"(
    (?<foo>foo)
    |(?<bar>bar)
    |(?<baz>baz)
)";

var regexOptions = RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace;
var regex = new Regex(pattern, regexOptions);
var matches = regex.Matches("bar");

foreach (Match match in matches)
{
    int? mark = null;

    if (match.Groups["foo"].Success)
    {
        mark = 1;
    }
    else if (match.Groups["bar"].Success)
    {
        mark = 2;
    }
    else if (match.Groups["baz"].Success)
    {
        mark = 3;
    }
}

Basically, I need to reconstruct the entire regular expression to see which capture group was actually matched.

This seems backwards. Is there a better way I can do the same thing?

The reason I need this is because in a tokenizer I don't just need to know if the syntax is valid but also which token type the matched token actually is.

Put the possible tokens in a dictionary and loop over them. Then try to match right at the start of the string and see which one was matched - here, grab the token name, consume the string and break the loop. — Jan, Mar 13 '18 at 13:07
If you need to work out which token was matched, why not just read the value of the outer group? — Rawling, Mar 13 '18 at 13:40
@Rawling I don't completely understand your comment. Of course I have the string value of the matched tokens but I'd still need to figure out what type that token has. Let's say I'm matching 20 different keywords, I don't want to do 20 if statements for that. — IluTov, Mar 13 '18 at 15:08
Aah yes I didn't go through entire question. So your problem is with multiple if statements. With `MARK` you have if statements too. — revo, Mar 13 '18 at 15:12
@revo No, with MARK you have no if statements because I could pass the correct token type (which is an enum) to the MARK value. This way I'd simply extract the MARK value. — IluTov, Mar 13 '18 at 15:28
May you show me this *I could pass the correct token type (which is an enum) to the MARK value*? I didn't get it. — revo, Mar 13 '18 at 15:30
If you have an enum you can just use `Enum.Parse` to turn the string you have into the enum value. — Rawling, Mar 13 '18 at 15:34
@Rawling A string literal token token `"Foo"` is not a single value. There can be a unlimited number of string literals. This is nothing `Enum.Parse` can do for me. — IluTov, Mar 13 '18 at 15:47

score 1 · Answer 1 · answered Mar 14 '18 at 04:24

1

If you insist to use the MARK info or want to use PCRE regexes from .NET in general, take a look at PCRE.NET, which is a .NET wrapper (available via NuGet) for the PCRE library. It offers a lot of PCRE's features for use from .NET, including Mark retrieval.

Here is a short example:

using PCRE;
using System.Linq;
namespace PCREdNET
{
    class Program
    {
        static void Main(string[] args)
        {
            var marks = PcreRegex.Matches("bar", "(?|foo(*:1)|bar(*:2)|baz(*:3))")
                       .Select(m => m.Mark)
                       .ToList();
        }
    }
}

answered Mar 14 '18 at 04:24

wp78de

18,207
7
43
71

Some context on C#'s PCRE-compatibility: https://stackoverflow.com/a/26504537/8291949 – wp78de Mar 14 '18 at 04:27
Well, I'm not insisting on using `MARK``, I'm just wondering if there's a similar solution that does not require reconstructing the branches of the regular expression itself. The library looks interesting! – IluTov Mar 14 '18 at 08:04

Regex - MARK to see which capture group was matched

1 Answers1