8

.NET offers a Capture collection in its RegularExpression implementation so you can get all instances of a given repeating group rather than just the last instance of it. That's great, but I have a repeating group with subgroups and I'm trying to get at the subgroups as they are related under the group, and can't find a way. Any suggestions?

I've looked at number of other questions, e.g.:

but I have found no applicable answer either affirmative ("Yep, here's how") or negative ("Nope, can't be done.").

For a contrived example say I have an input string:

abc d x 1 2 x 3 x 5 6 e fgh

where the "abc" and "fgh" represent text that I want to ignore in the larger document, "d" and "e" wrap the area of interest, and within that area of interest, "x n [n]" can repeat any number of times. It's those number pairs in the "x" areas that I'm interested in.

So I'm parsing it using this regular expression pattern:

.*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*

which will find exactly one match in the document, but capture the "x" group many times. Here are the three pairs I would want to extract in this example:

  • 1, 2
  • 3
  • 5, 6

but how can I get them? I could do the following (in C#):

using System;
using System.Text;
using System.Text.RegularExpressions;

string input = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
foreach (var x in Regex.Match(input, pattern).Groups["x"].Captures) {
    MessageBox.Show(x.ToString());
}

and since I'm referencing group "x" I get these strings:

  • x 1 2
  • x 3
  • x 5 6

But that doesn't get me at the numbers themselves. So I could do "fir" and "sec" independently instead of just "x":

using System;
using System.Text;
using System.Text.RegularExpressions;

string input = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
Match m = Regex.Match(input, pattern);
foreach (var f in m.Groups["fir"].Captures) {
    MessageBox.Show(f.ToString());
}

foreach (var s in m.Groups["sec"].Captures) {
    MessageBox.Show(s.ToString());
}

to get:

  • 1
  • 3
  • 5
  • 2
  • 6

but then I have no way of knowing that it's the second pair that's missing the "4", and not one of the other pairs.

So what to do? I know I could easily parse this out in C# or even with a second regex test on the "x" group, but since the first RegEx run has already done all the work and the results ARE known, it seems there ought to be a way to manipulate the Match object to get what I need out of it.

And remember, this is a contrived example, the real world case is somewhat more complex so just throwing extra C# code at it would be a pain. But if the existing .NET objects can't do it, then I just need to know that and I'll continue on my way.

Thoughts?

Community
  • 1
  • 1
bob
  • 452
  • 4
  • 11

4 Answers4

5

I am not aware of a fully build in solution and could not find one after a quick search, but this does not exclude the possibility that there is one.

My best suggestion is to use the Index and Length properties to find matching captures. It seems not really elegant but you might be able to come up with some quite nice code after writing some extension methods.

var input = "abc d x 1 2 x 3 x 5 6 e fgh";

var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";

var match = Regex.Match(input, pattern);

var xs = match.Groups["x"].Captures.Cast<Capture>();

var firs = match.Groups["fir"].Captures.Cast<Capture>();
var secs = match.Groups["sec"].Captures.Cast<Capture>();

Func<Capture, Capture, Boolean> test = (inner, outer) =>
    (inner.Index >= outer.Index) &&
    (inner.Index < outer.Index + outer.Length);

var result = xs.Select(x => new
                            {
                                Fir = firs.FirstOrDefault(f => test(f, x)),
                                Sec = secs.FirstOrDefault(s => test(s, x))
                            })
               .ToList();

Here one possible solution using the following extension method.

internal static class Extensions
{
    internal static IEnumerable<Capture> GetCapturesInside(this Match match,
         Capture capture, String groupName)
    {
        var start = capture.Index;
        var end = capture.Index + capture.Length;

        return match.Groups[groupName]
                    .Captures
                    .Cast<Capture>()
                    .Where(inner => (inner.Index >= start) &&
                                    (inner.Index < end));
    }
}

Now the you can rewrite the code as follows.

var input = "abc d x 1 2 x 3 x 5 6 e fgh";

var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";

var match = Regex.Match(input, pattern);

foreach (Capture x in match.Groups["x"].Captures)
{
    var fir = match.GetCapturesInside(x, "fir").SingleOrDefault();
    var sec = match.GetCapturesInside(x, "sec").SingleOrDefault();
}
Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143
  • Brilliant. Served the purpose elegantly and efficiently. Proof that if "Match.Group.Capture.Group" isn't in .NET, it ought to be. Thanks! – bob Dec 17 '12 at 19:35
  • @user1910619 I respectfully disagree...see my answer to the problem. – ΩmegaMan Dec 17 '12 at 19:47
3

Will it always be a pair versus single? You could use separate capture groups. Of course, you lose the order of items with this method.

var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var re = new Regex(@"d\s(?<x>x\s((?<pair>\d+\s\d+)|(?<single>\d+))\s)*e");

var m = re.Match(input);
foreach (Capture s in m.Groups["pair"].Captures) 
{
    Console.WriteLine(s.Value);
}
foreach (Capture s in m.Groups["single"].Captures)
{
    Console.WriteLine(s.Value);
}

1 2
5 6
3

If you need the order, I'd probably go with Blam's suggestion to use a second regular expression.

Adam Prescott
  • 943
  • 1
  • 8
  • 20
2

I suggest you look into the unique to .net regex the Balanced Groups.

Here is a regex using that to stop the match when the group (either a non digit or an X) is found to close the group. Then the matches are accessed via the captures as required:

string data = "abc d x 1 2 x 3 x 5 6 e fgh";

string pattern =
@"(?xn)    # Specify options in the pattern
           # x - to comment (IgnorePatternWhitespace)
           # n - Explicit Capture to ignore non named matches

(?<X>x)                    # Push the X on the balanced group
  ((\s)(?<Numbers>\d+))+   # Load up on any numbers into the capture group
(?(Paren)(?!))             # Stop any match that has an X
                           #(the end of the balance group)";


var results = Regex.Matches(data, pattern)
                   .OfType<Match>()
                   .Select ((mt, index) => string.Format("Match {0}: {1}",
                                             index,
                                             string.Join(", ",
                                                         mt.Groups["Numbers"]
                                                         .Captures
                                                         .OfType<Capture>()
                                                         .Select (cp => cp.Value))))
                   ;

results.ToList()
       .ForEach( result => Console.WriteLine ( result ));
/* Output

Match 0: 1, 2
Match 1: 3
Match 2: 5, 6

*/ 
ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122
  • This is another clever solution to the question I asked, thank you. Unfortunately, for my real-world case the named subgroups are not all the same, they have their own independent patterns associated with their names. Ultimately the solution has to be in the code rather than the regex pattern. – bob Dec 17 '12 at 19:48
  • @bob I only worked with the example you gave, if the pattern has differing sub groups, then the system of matched balanace groups can be applied to the subgroups as well or a in pattern if clause could handle the data capture groups independently depending on the needs. – ΩmegaMan Dec 17 '12 at 19:54
  • Hm, I guess I'm not understanding something here; in all cases doesn't the solution have to come from .NET code and not from more advanced regex expressions? I can't see how we could change the regex to pull the same results ("1,2", "3", "4,5") out of something more complex, like `ab x id:7 val:8 c d x other:9 id:1 other:10 val:2 otherjunk x id:3 x val:6 id:5 e fgh`. And even if we could, the regex complexity would presumably far outweigh the need, especially when the answer is already loaded into the Match object from the simple original regex match, and just needed to be accessed somehow. – bob Dec 17 '12 at 20:17
1

I have seen OmegaMan's answer and know that you prefer a C# code instead of regex solution. But I wanted to present one alternative anyway.

In .NET you can reuse named groups. Every time something is captured with that group, it's pushed onto the stack (that's what OmegaMan was referring to by "balancing groups"). You can use this to push an empty capture onto the stack for every x you find:

string pattern = @"d (?<x>x(?<d>) (?:(?<d>\d+) )*)*e";

So now after matching x the (?<d>) pushes an empty capture onto the stack. Here is the Console.WriteLine output (one line per capture):

 
1
2

3

5
6

Hence, when you then walk through Regex.Match(input, pattern).Groups["d"].Captures and take note of empty strings, you know that a new group of numbers has started.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Ah, I didn't realize you could reuse the named groups with different patterns, I can see how this might work. Helpful information! I wasn't limiting myself to a C# solution at all, just recognizing that if .NET didn't provide a way to get the groups within a capture, some special C# code would inevitably be necessary (as in this case, watching for blank capture values). I do still prefer the solution @Daniel offered. Besides being nicely generalizable, I find it keeps the regex pattern complexity more proportionate to the complexity of the input. Thanks though! – bob Dec 18 '12 at 15:27