3

I am trying to build a very simplified lexer using regex and named groups in c#.

I can get all the matched tokens along with position just fine. But I cannot find a way to get the matched group name also.

I was planning to use that as the token type.

Here is a small example designed to lex simple sql.

var matches = Regex.Matches("Select * from items where id > '10'", @"
(?:
(?<string>'[^']*')|
(?<number>\d+)|
(?<identifier>[a-zA-Z][a-zA-Z_0-9]+)|
(?:\s+)|
(?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*)|
(?<other>.*)
)+
", RegexOptions.IgnorePatternWhitespace)
.Cast<Match>()
.SelectMany (m => m
    .Groups
    .Cast<Group>()
    .SelectMany (g => g
        .Captures
        .Cast<Capture>()
        .Select (c => new {c.Index, c.Length, c.Value})))
.Skip(1)
.Where (m => m.Length > 0)
.OrderBy (m => m.Index);

This returns a small result like this:

0 6 Select 
7 1 * 
9 4 from 
14 9 items  
24 5 where  
30 2 id 
33 1 >  
35 4 '10' 

But how can I get the capture group names into the table, is it possible?

This is not a home work exercise or any type of school work, its an experiment I am doing for a simple automation api for one of our products.

I can probably rewrite it using a more verbose solution but I kind of like the "on liner approach" of this one ;)

And if all else fails I already have a full lexer using real classes and much more advanced pattern matching, but that is not really required for this :D

UPDATE! I know what groups are available, what I like to get is, for each capture in the result, which group was it that caught it.

As the first comment refers to, there is a method to get all groups from a regex, but then you have to fetch the results by the group, there does not seem to be a way to get the group from the capture.

David Mårtensson
  • 7,550
  • 4
  • 31
  • 47
  • Possible duplicate: http://stackoverflow.com/questions/1381097/regex-get-the-name-of-captured-groups-in-c-sharp – kaveman Nov 21 '13 at 17:01
  • what do you mean by "group name"? – Jodrell Nov 21 '13 at 17:06
  • the names of the capture groups, e.g. in `(?'[^']*')|`, `string` is the name of the capture group – kaveman Nov 21 '13 at 17:07
  • @kaveman not really a duplicate. The question there is quite like mine, but the answer goes the other way by fetching group names first and using the names getting the captures. I hope there is a way to get the group from the actual capture, and if that is impossible, then my idea will need to be turned around. So I would suggest that a "Not possible" is actually a better answer to my question than the "GetGroupnames" method from that question. – David Mårtensson Nov 21 '13 at 19:34
  • I had the same problem few years back. After spending a lot of time i was convinced that you can only get the group from group name AND not the other way round. – Usman Zafar Nov 25 '13 at 10:24

1 Answers1

2

[Appended a new solution I found following the link to the possible duplicate]

The answer to my question seems to be that it is not possible to get group names in any way except from the regex object.

I used part of the solution from the first comment reference to work around this but I would have liked to be able to go the more direct route.

Here is the solution I ended up with. (uses Linqpad dump)

var source = "select * from people where id > 10";

var re = new Regex(@"
    (?:
    (?<reserved>select|from|where|and|or|null|is|not)|
    (?<string>'[^']*')|
    (?<number>\d+)|
    (?<identifier>[a-z][a-z_0-9]+|\[[^\]]+\])|
    (?:\s+)|
    (?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*|,|.)|
    (?<other>.*)
    )+
    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Compiled);
    
(
    from name 
    in re.GetGroupNames() 
    select new {name = name, captures = re.Match(source).Groups[name].Captures}
)
.Where (r => r.name != "0")
.SelectMany (r => (
    from Capture c 
    in r.captures 
    where c.Length > 0
    select new {Type = r.name, Index = c.Index, Length = c.Length, Value = c.Value}
    )
).OrderBy (r => r.Index).ToList().Dump();

Based on a comment on the possible duplicate, fro NET 4.7 Group now have a Name property which was not present when I made this test so in case anyone stumbles upon this and is not discouraged enough here is a version that does what I originally tried but no longer need for anything :)

var matches = Regex.Matches("Select * from items where id > '10'", @"
(?:
(?<string>'[^']*')|
(?<number>\d+)|
(?<identifier>[a-zA-Z][a-zA-Z_0-9]+)|
(?:\s+)|
(?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*)|
(?<other>.*)
)+
", RegexOptions.IgnorePatternWhitespace)
.Cast<Match>()
.SelectMany(m => m
   .Groups
   .Cast<Group>()
   .SelectMany(g => g
      .Captures
      .Cast<Capture>()
      .Select(c => new { c.Index, c.Length, c.Value, g.Name })))
.Skip(1)
.Where(m => m.Length > 0)
.OrderBy(m => m.Index).Dump();
David Mårtensson
  • 7,550
  • 4
  • 31
  • 47