Repeatable, complex regular expression, with dot '.' delimited separators

Question

I have a regular expression. It contains a required named capture group, and some optional named capture groups. It captures individual matches and parses the sections into the named groups which I need.

Except, now I need it to repeat.

Essentially, my regular expression represents an single atomic unit in a (potentially) much longer string. Instead of matching my regex exactly, the target string will usually contain repeated instances of the regex, separated by the dot '.' character.

For example, if this is what my regular expression captures: <some match>

The actual string could look like any of these:

<some match>
<some match>.<some other match>
<some match>.<some other match>.<yet another match>

What is the simplest way in which to modify the original regular expression, to account for the repeating patterns, while ignoring the dots?

I'm not sure if it's actually needed, but here is the regular expression which I'm using to capture individual segments. Again, I'd like to enhance this to account for optional additional segments. I'd like to have each segment appear as another "match" in the result set;

^(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])?(?:\[(?<index2>[0-9]+)\])?(?:\[(?<index3>[0-9]+)\])?$

It is intended to parse a class path, with up to three optional index accessors. (i.e. "member.sub_member[0].sub_sub_member[0][1][2]")

I suspect the answer involves look-ahead or look-behind, for which I am not entirely familiar.

I currently use String.Split to separate string segments. But I figure if the enhancement to the regex is simple enough, I skip the extra Split step, and re-use the regex as a validation mechanism, as well.

EDIT:

As an additional wrench in the gears, I'd like to disallow any dot '.' character from the beginning or end of the string. They should only exist as separators between path segments.

A simplistic approach would be to split the string on `.` and then run your regex on each one. — Tim S., Jul 19 '13 at 12:09
I currently do that. I figured if the enhancement to the regular expresison is simple enough, I'd be able to forgo the string.Split, and additionally be able to validate the string before it's parsed. — BTownTKD, Jul 19 '13 at 12:11
In other words, you are looking for contigous matches separated by a dot from the begining to the end of the string, and nothing else, isn't it? — Casimir et Hippolyte, Jul 19 '13 at 12:18
In most languages (/"flavours of regex"), it's not possible to count how many capture groups are matched by * or +... however, luckily for you, in .NET it's easy. See: http://stackoverflow.com/questions/3029127/is-there-a-regex-flavor-that-allows-me-to-count-the-number-of-repetitions-matche — Tom Lord, Jul 19 '13 at 12:20
Shouldn't that be fairly simple? For simplicitys sake I'm calling your entire regexp `ident`, so replace that by your regex. Then it'll be something like this: `ident(\.ident)*`, no? — Alxandr, Jul 19 '13 at 12:21
Four excellent answers; I will need to test and digest each one to figure out which approach works best for me. If only I could give four "correct answer" checkmarks... — BTownTKD, Jul 19 '13 at 12:58
@Alexandr, please post that as the answer; it is a viable option. — BTownTKD, Jul 19 '13 at 13:18

p.s.w.g · Accepted Answer · 2013-07-19T13:32:41.157

3

You don't really need to use any look-arounds. You can put a (^|\.) in front of your main pattern and then a + after it. That will allow you to make a repeating, .-separated sequence. I would also recommend you combine your <index> groups into a single capture for simplicity (I used * to match any number of indexes, but you can just as easily use {0,3} to match just only up to 3). The final pattern would be:

(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])*)+$

For example:

var input = "member.sub_member[0].sub_sub_member[0][1][2]";
var pattern = @"(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])*)+$";
var match = Regex.Match(input, pattern);
var parts = 
    (from Group g in match.Groups
     from Capture c in g.Captures
     orderby c.Index
     select c.Value)
    .Skip(1);

foreach(var part in parts)
{
    Console.WriteLine(part);
}

Which will output:

member
sub_member
0
sub_sub_member
0
1
2

Update: This pattern will ensure that the string cannot have any leading or trailing dots. It's a monster, but it should work:

^(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3}(?:\.(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3})*$

Or this one, although I did have to give up on my 'no-look-arounds' idea:

^(?!\.)(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3})*$

edited Jul 19 '13 at 13:32

answered Jul 19 '13 at 12:20

p.s.w.g

146,324
30
291
331

I like the simplification. I may need to change the trailing * to a {0,3}, because there is a hard limit of 3 index accessors. But that's beside the point. Will the (^|\.) pattern you've prescribed ensure that there are no head-or-tail dots? I.e. dots should only exist between path segments - not at the beginning or end. – BTownTKD Jul 19 '13 at 12:58
After plugging the regex in, it only seems to generate a single match. It 'eats' all the preceding path segments, and streats the entire thing as a single 'member' group. – BTownTKD Jul 19 '13 at 13:10
@BTownTKD You're right, it does allow leading `.`'s (I'll work on fixing that), but it definitely shouldn't 'eat' the preceding segments. You probably just need to tweak how you're iterating through the results because each group can now have multiple captures. – p.s.w.g Jul 19 '13 at 13:15
I'm using Derek Slager's online .NET regex tester. It typically iterates through all matches and captures, and displays them in a nice format. http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx – BTownTKD Jul 19 '13 at 13:19
@BTownTKD I suspect that tool just doesn't handle multiple captures and show only the last capture per group (most online regex testers work that way). I've tested this in LINQPad, and it does allow you to process each capture separately. See my updated answer for an alternative that will disallow leading dots. – p.s.w.g Jul 19 '13 at 13:32
Thanks sir. I have to give you the coveted check-mark, if for no other reason than "sheer number of regular expressions available." (But also, because it's correct.) – BTownTKD Jul 19 '13 at 13:38
@BTownTKD Thanks. In general I try to solve regex problems without using look-arounds first because (1) many languages don't support them, so if I ever have to port the code it makes it a lot easier, and (2) many programmers aren't familiar with how they work, so if I have to hand it off to someone else, it makes it easier for them to understand and maintain. But sometimes they are required anyway. Happy coding. – p.s.w.g Jul 19 '13 at 13:44

score 1 · Answer 2 · answered Jul 19 '13 at 12:19

The easiest way is likely to split the string using string.Split on the '.' character, and then apply your regular expression to each element in the resulting array. A Regex that long would have some brutal performance and potential lookahead/behind problems anyway.

Alex Filipovici · Answer 3 · 2013-07-19T12:33:51.627

Try this beast out:

(?<=^|\.)?((?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])?(?:\[(?<index2>[0-9]+)\])?(?:\[(?<index3>[0-9]+)\])?)(?=\.){0,3}$?

Here's a sample console application:

class Program
{
    public static void Main()
    {
        var input = @"member.sub_member[0].sub_sub_member[0][1][2]";
        var matches = Regex.Matches(input, @"(?<=^|\.)?((?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])?(?:\[(?<index2>[0-9]+)\])?(?:\[(?<index3>[0-9]+)\])?)(?=\.){0,3}$?");
        foreach (Match match in matches)
        {
            Console.Write("Member: {0} Index {1} Index2: {2} Index3 {3}\r\n", 
                match.Groups["member"].Value,
                match.Groups["index"].Value,
                match.Groups["index2"].Value,
                match.Groups["index3"].Value);
        }
    }
}

Casimir et Hippolyte · Answer 4 · 2013-07-19T12:31:16.380

1

You can use \G to be sure to have contiguous results and a lookahead to check if the pattern is followed by a dot or the end of the string:

var pattern = @"(?:^|\G\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])?(?:\[(?<index2>[0-9]+)])?(?:\[(?<index3>[0-9]+)])?(?=\.|$)";

from msdn: with \G "The match must start at the position where the previous match ended"

edited Jul 19 '13 at 12:31

answered Jul 19 '13 at 12:25

Casimir et Hippolyte

88,009
5
94
125

How could I change this to disallow dot '.' characters at the beginning or the end? I want to ensure they only exist between path segments. – BTownTKD Jul 19 '13 at 13:12
I tried plugging this into my test program (from my answer) and it shows `"member"` as the entire match – p.s.w.g Jul 19 '13 at 13:35

Repeatable, complex regular expression, with dot '.' delimited separators

4 Answers4