I'm writing a small tokenizer in C#.
In the PCRE Regex specification there's the neat MARK
keyword:
https://pcre.org/current/doc/html/pcre2syntax.html#SEC23
This is how it works:
<?php
$string = 'bar';
$matches = [];
preg_match('~(?|foo(*:1)
|bar(*:2)
|baz(*:3))~x', $string, $matches);
var_dump($matches);
//> array(2) {
//> [0]=> string(3) "bar"
//> ["MARK"]=> string(1) "2"
//> }
As you can see, the MARK
parameter in the result set allows you to see which branch of the regular expression was actually matched. Unfortunately, the MARK
keyword is not supported in .NETs Regex framework. This is what I'm doing right now:
var pattern = @"(
(?<foo>foo)
|(?<bar>bar)
|(?<baz>baz)
)";
var regexOptions = RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace;
var regex = new Regex(pattern, regexOptions);
var matches = regex.Matches("bar");
foreach (Match match in matches)
{
int? mark = null;
if (match.Groups["foo"].Success)
{
mark = 1;
}
else if (match.Groups["bar"].Success)
{
mark = 2;
}
else if (match.Groups["baz"].Success)
{
mark = 3;
}
}
Basically, I need to reconstruct the entire regular expression to see which capture group was actually matched.
This seems backwards. Is there a better way I can do the same thing?
The reason I need this is because in a tokenizer I don't just need to know if the syntax is valid but also which token type the matched token actually is.