2

I'm writing a small tokenizer in C#.

In the PCRE Regex specification there's the neat MARK keyword:
https://pcre.org/current/doc/html/pcre2syntax.html#SEC23

This is how it works:

https://3v4l.org/ErCrp

<?php

$string = 'bar';
$matches = [];

preg_match('~(?|foo(*:1)
               |bar(*:2)
               |baz(*:3))~x', $string, $matches);

var_dump($matches);

//> array(2) { 
//>     [0]=> string(3) "bar" 
//>     ["MARK"]=> string(1) "2" 
//> } 

As you can see, the MARK parameter in the result set allows you to see which branch of the regular expression was actually matched. Unfortunately, the MARK keyword is not supported in .NETs Regex framework. This is what I'm doing right now:

var pattern = @"(
    (?<foo>foo)
    |(?<bar>bar)
    |(?<baz>baz)
)";

var regexOptions = RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace;
var regex = new Regex(pattern, regexOptions);
var matches = regex.Matches("bar");

foreach (Match match in matches)
{
    int? mark = null;

    if (match.Groups["foo"].Success)
    {
        mark = 1;
    }
    else if (match.Groups["bar"].Success)
    {
        mark = 2;
    }
    else if (match.Groups["baz"].Success)
    {
        mark = 3;
    }
}

Basically, I need to reconstruct the entire regular expression to see which capture group was actually matched.

This seems backwards. Is there a better way I can do the same thing?

The reason I need this is because in a tokenizer I don't just need to know if the syntax is valid but also which token type the matched token actually is.

IluTov
  • 6,807
  • 6
  • 41
  • 103
  • What about named capturing groups? – revo Mar 13 '18 at 13:05
  • Put the possible tokens in a dictionary and loop over them. Then try to match right at the start of the string and see which one was matched - here, grab the token name, consume the string and break the loop. – Jan Mar 13 '18 at 13:07
  • If you need to work out which token was matched, why not just read the value of the outer group? – Rawling Mar 13 '18 at 13:40
  • @revo My current solution does use named capture groups ;) – IluTov Mar 13 '18 at 15:06
  • @Rawling I don't completely understand your comment. Of course I have the string value of the matched tokens but I'd still need to figure out what type that token has. Let's say I'm matching 20 different keywords, I don't want to do 20 if statements for that. – IluTov Mar 13 '18 at 15:08
  • Aah yes I didn't go through entire question. So your problem is with multiple if statements. With `MARK` you have if statements too. – revo Mar 13 '18 at 15:12
  • @revo No, with MARK you have no if statements because I could pass the correct token type (which is an enum) to the MARK value. This way I'd simply extract the MARK value. – IluTov Mar 13 '18 at 15:28
  • 1
    May you show me this *I could pass the correct token type (which is an enum) to the MARK value*? I didn't get it. – revo Mar 13 '18 at 15:30
  • If you have an enum you can just use `Enum.Parse` to turn the string you have into the enum value. – Rawling Mar 13 '18 at 15:34
  • @Rawling A string literal token token `"Foo"` is not a single value. There can be a unlimited number of string literals. This is nothing `Enum.Parse` can do for me. – IluTov Mar 13 '18 at 15:47

1 Answers1

1

If you insist to use the MARK info or want to use PCRE regexes from .NET in general, take a look at PCRE.NET, which is a .NET wrapper (available via NuGet) for the PCRE library. It offers a lot of PCRE's features for use from .NET, including Mark retrieval.

Here is a short example:

using PCRE;
using System.Linq;
namespace PCREdNET
{
    class Program
    {
        static void Main(string[] args)
        {
            var marks = PcreRegex.Matches("bar", "(?|foo(*:1)|bar(*:2)|baz(*:3))")
                       .Select(m => m.Mark)
                       .ToList();
        }
    }
}
wp78de
  • 18,207
  • 7
  • 43
  • 71
  • Some context on C#'s PCRE-compatibility: https://stackoverflow.com/a/26504537/8291949 – wp78de Mar 14 '18 at 04:27
  • Well, I'm not insisting on using `MARK``, I'm just wondering if there's a similar solution that does not require reconstructing the branches of the regular expression itself. The library looks interesting! – IluTov Mar 14 '18 at 08:04