0

I'm validating a string that must consist of a prefix that matches a pattern (call it pat1), followed by 0 or more repeating patterns, each containing 3 sub-patterns (pat2, pat3, pat4) that I'd like to capture individually. Simplified a bit, the regex basically looks like this:

^pat1 ((pat2) (pat3) (pat4))*$

The ^ and $ anchors are present because the match should fail if the string contains any extraneous characters at the beginning or end.

Using Regex.Match against a valid expression, I get a Match object with the following characteristics:

  • 5 Groups, consisting of the entire expression, the last matched group (i.e. last occurrence of pat2 pat3 pat4), and matches for the last pat2, pat3, and pat4 individually. That's all well documented and exactly as expected.
  • Groups[1].Captures containing a Capture for each occurrence of pat2 pat3 pat4. Also as expected.

What I can't figure out how to do is extract pat2, pat3, and pat4 indidually from each Capture. Is this even possible with a single execution of a single Regex? I'm starting to think not. (If Capture had a Groups property, that's probably exactly what I'd need. But it doesn't.)

If I take a completely different approach using Regex.Matches instead of Regex.Match, I think I could capture all the individual bits I need, but it wouldn't validate that the string cannot contain extraneous characters before, after, or between the matches.

So what I've resorted to for now is executing the original expression, iterating through Groups[1].Captures and executing a Regex again (this time just (pat2) (pat3) (pat4)) just to parse out the 3 groups within each capture. This feels like I'm making the engine repeat work that's already been done, but maybe it's the only way?

As a side-note, I did a lot of searching for "groups within captures", etc., and was very surprised to come up empty. I would think this is a common scenario, which makes me wonder if I should be taking a completely different approach. I'm open to any suggestion that meets all validation and capturing requirements in a reasonably efficient manner.

Todd Menier
  • 37,557
  • 17
  • 150
  • 173
  • 1
    Name the capture groups, then it's easy to extract them. –  Aug 29 '19 at 19:25
  • 1
    Can you produce a working example of the code? – Sach Aug 29 '19 at 19:31
  • Say, [this is your regex and a string](http://regexstorm.net/tester?p=%5epat1%28+%28pat%5cd%29+%28pat%5cd%29+%28pat%5cd%29%29*%24&i=pat1+pat2+pat3+pat4+pat5+pat6+pat7). What output do you need *exactly*? – Wiktor Stribiżew Aug 29 '19 at 19:59
  • Do you realize `pat2`, `pat3`, and `pat4` are in Group 2, Group 3 and Group 4? They are not in Group 1. You can't find any info about groups inside captures because it is vice versa. – Wiktor Stribiżew Aug 29 '19 at 20:06
  • @WiktorStribiżew In addition to confirming that the string matches the expression as a whole, I would want to capture `pat2`, `pat3`, `pat4`, `pat5`, `pat6`, and `pat7` as 6 individual strings. Thanks! – Todd Menier Aug 29 '19 at 20:08
  • 1
    So, you already do it. Study the linked post, it is all explained there. – Wiktor Stribiżew Aug 29 '19 at 20:08
  • Wow...I wonder if I missed something obvious here... – Todd Menier Aug 29 '19 at 20:09
  • @WiktorStribiżew I don't know if I'd call it "obvious" but I did miss it. The first set of 3 captured values I want are at `Groups[2].Captures[0]`, `Groups[3].Captures[0]`, `Groups[4].Captures[0]`, next are at `Groups[2].Captures[1]`, `Groups[3].Captures[1]`, `Groups[4].Captures[1]`, and so on. Intuitively the nesting seems backwards, but it is indeed all there. Thanks for the help. – Todd Menier Aug 29 '19 at 20:27
  • 1
    Why "backwards"? It is vice versa, numbered capturing groups are numbered from left to right. Are a native speaker of an RTL writing language? – Wiktor Stribiżew Aug 29 '19 at 20:29
  • By "backwards" I don't mean RTL, I mean the Groups/Captures hierarchy feels inverted in this case. Ignoring the domain of regexes for a moment, we basically have a "collection of collections". I would expect to find my first group of 3 things at `outer[0].inner[0, 1, 2]`. Instead, they are at `outer[0, 1, 2].inner[0]`. Not suggesting it's wrong, just explaining why at a glance it feels unintuitive. Hope that makes sense. – Todd Menier Aug 29 '19 at 20:44

0 Answers0