Regex: Repeated capturing groups

Question

I have to parse some tables from an ASCII text file. Here's a partial sample:

QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
RECKITTBEN  192.50  209.00  192.50  201.80    5.21      34      2850     5.707
RUPALIINS   150.00  159.00  150.00  156.25    6.29       4        80      .125
SALAMCRST   164.00  164.75  163.00  163.25    -.45      80      8250    13.505
SINGERBD    779.75  779.75  770.00  773.00    -.89       8        95      .735
SONARBAINS   68.00   69.00   67.50   68.00     .74      11      3050     2.077

The table consists of 1 column of text and 8 columns of floating point numbers. I'd like to capture each column via regex.

I'm pretty new to regular expressions. Here's the faulty regex pattern I came up with:

(\S+)\s+(\s+[\d\.\-]+){8}

But the pattern captures only the first and the last columns. RegexBuddy also emits the following warning:

You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations.

I've consulted their help file, but I don't have a clue as to how to solve this.

How can I capture each column separately?

@Tim: Yes I intend to write the program in C#. But at the moment, I'm prototyping it in Python. — invarbrass, Jul 03 '10 at 20:01
See also: http://stackoverflow.com/questions/3029127/is-there-a-regex-flavor-that-allows-me-to-count-the-number-of-repetitions-matched/ — polygenelubricants, Jul 04 '10 at 07:59
It can be retrieved with group captures. Take a look at http://stackoverflow.com/questions/11051558/regular-expression-to-select-repeating-groups — Marko Kukovec, May 08 '13 at 09:00

score 17 · Accepted Answer · answered Jul 03 '10 at 19:58

In C# (modified from this example):

string input = "QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212";
string pattern = @"^(\S+)\s+(\s+[\d.-]+){8}$";
Match match = Regex.Match(input, pattern, RegexOptions.MultiLine);
if (match.Success) {
   Console.WriteLine("Matched text: {0}", match.Value);
   for (int ctr = 1; ctr < match.Groups.Count; ctr++) {
      Console.WriteLine("   Group {0}:  {1}", ctr, match.Groups[ctr].Value);
      int captureCtr = 0;
      foreach (Capture capture in match.Groups[ctr].Captures) {
         Console.WriteLine("      Capture {0}: {1}", 
                           captureCtr, capture.Value);
         captureCtr++; 
      }
   }
}

Output:

Matched text: QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
...
    Group 2:      1.212
         Capture 0:  11.00
         Capture 1:    11.10
         Capture 2:    11.00
...etc.

Thanks for the heads up. I'm looking into the Group.Captures property. — invarbrass, Jul 03 '10 at 20:06
`Captures` is a neat feature, but it seems like overkill here. Why not just split each line on whitespace? Even if you use the regex to validate the format of the line, it's still less work. — Alan Moore, Jul 04 '10 at 09:27

score 5 · Answer 2 · edited May 23 '17 at 12:09

If you want to know what the warning is appearing for, it's because your capture group matches multiple times (8, as you specified) but the capture variable can only have one value. It is assigned the last value matched.

As described in question 1313332, retrieving these multiple matches is generally not possible with a regular expression, although .NET and Perl 6 have some support for it.

The warning suggests that you could put another group around the whole set, like this:

(\S+)\s+((\s+[\d\.\-]+){8})

You would then be able to see all the columns, but of course they would not be separated. Because it's generally not possible to capture them separately, the more common intention is to capture all of it, and the warning helps remind you of this.

kennytm · Answer 3 · 2010-07-03T19:48:06.710

4

Unfortunately you need to repeat the (…) 8 times to get each column separately.

^(\S+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)$

If code is possible, you can first match those numeric columns as a whole

>>> rx1 = re.compile(r'^(\S+)\s+((?:[-.\d]+\s+){7}[-.\d]+)$', re.M)
>>> allres = rx1.findall(theAsciiText)

then split the columns by spaces

>>> [[p] + q.split() for p, q in allres]

edited Jul 03 '10 at 19:48

answered Jul 03 '10 at 19:38

kennytm

510,854
105
1,084
1,005

1

Kenny, thanks for the prompt response! I'm actually using that pattern right now. But I was wondering if there's a better solution using repeating capturing groups. – invarbrass Jul 03 '10 at 19:41
@invarbrass: Not with repeated capturing groups that I'm aware of. Regexes often work best if you don't try to overdo them with a one-shot. – Owen S. Jul 03 '10 at 19:54
KennyTM: Thanks! Your solution works - I was doing something similar, albeit a lot less elegantly. – invarbrass Jul 03 '10 at 19:57
3

.NET is special in that it keeps intermediate captures! See Tim's answer and http://stackoverflow.com/questions/3029127/is-there-a-regex-flavor-that-allows-me-to-count-the-number-of-repetitions-matched/ – polygenelubricants Jul 04 '10 at 07:59

Regex: Repeated capturing groups

3 Answers3

Linked

Related