Perl: extracting data from text using regex

Question

I am using Perl to do text processing with regex. I have no control over the input. I have shown some examples of the input below.

As you can see the items B and C can be in the string n times with different values. I need to get all the values as back reference. Or if you know of a different way i am all ears.

I am trying to use branch reset pattern (as outlined at perldoc: "Extended Patterns") I am not having much luck matching the string.

("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

My Perl is below, any help would be great. Thanks for any help you can give.

if($inputString =~/\("Data" \(Int "A" ([0-9]+)\)(?:\(Int "B" ([0-9]+)\)\(Int "C" ([0-9]+)\))+\(Int "D" ([0-9]+)\)\(Int "E" ([0-9]+)\)\)/) {

    print "\n\nmatched\n";

    print "1: $1\n";
    print "2: $2\n";
    print "3: $3\n";
    print "4: $4\n";
    print "5: $5\n";
    print "6: $6\n";
    print "7: $7\n";
    print "8: $8\n";
    print "9: $9\n";

}

It would greatly help if you could describe what are you trying to achieve. Not how (get all the values as back reference), but what (i.e. I need to get the values to be able to ...) — , May 17 '09 at 18:31

score 10 · Accepted Answer · answered May 17 '09 at 17:28

Don't try to use one regex a set of regexes and splits are easier to understand:

#!/usr/bin/perl

use strict;
use warnings;

while (<DATA>) {
    next unless my ($data) = /\("Data" (.*)\)/;
    print "on line $., I saw:\n";
    for my $item ($data =~ /\((.*?)\)/g) {
        my ($type, $var, $num) = split " ", $item;
        print "\ttype $type var $var num $num\n";
    }
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

If your data can stretch across lines, I would suggest using a parser instead of a regex.

Beano · Answer 2 · 2009-05-26T08:51:12.677

3

I am not sure what benefit there would be in getting the values as back references - who would you wish to deal with the case of duplicated keys (like "C" in the second line). Also I am not sure what you wish to do with the values once extracts.

But I would start with something like:

use Data::Dumper;

while (<DATA>)
{
    my @a = m!\(Int "(.*?)" ([0-9]+)\)!g;
    print Dumper(\@a);
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C"     6)(Int "D" 34896)(Int "E" 38046)) 
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

This gives you an array of repeated key,value(s).

edited May 26 '09 at 08:51

answered May 17 '09 at 16:53

Beano

7,551
3
24
27

\d does not match [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}", MONGOLIAN DIGIT FIVE). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want). – Chas. Owens May 17 '09 at 17:39
Explanation of the m!! regex. I tend to use the 'm!!' form of pattern match to the usual '//' because I have to escape the '/' character more often than the '!' character. You can use any character to delimit your pattern match (this also applies to sed). The regex itself is matches the characters '(Int "' then marks the least number of any character followed by '" ' then marks some digits followed by ')'. Use the 'g' extension to match repeatedly and you have a solution. If this does not explain what you intended, please ask again. – Beano May 17 '09 at 22:14
With regard to the \d matching a UNICODE character with the digit attribute, whereas [0-9] matches the specific span of ASCII characters. I guess when considering this, you need to bear in mind what your input data is going to consist of - the above example I made the assumption of a ASCII data range (a reasonable assumption I thought), as this illustrated the use of the regular expression. I would have thought if my input data was Mongolian, then I would probably have been interested in "digit five" and therefore the regex would still have been valid. – Beano May 17 '09 at 22:23
@Beano Assumptions such as "my data will always be ASCII" are the source of lots of bugs. Use [0-9] if that is what you are looking for, only use \d if you mean to match any digit character ("\x{1815}" is just a way out there example of the sort of character you don't want to match, there are others that are more likely to show up like "\x{FF15}" (FULLWIDTH DIGIT FIVE) which looks like a normal "\x{0035}", but you can't do math with it. – Chas. Owens May 17 '09 at 23:09
I would argue that without the full context of the application, input data, etc. it is hard to say one way or the other what is the "correct" behavior. – Beano May 18 '09 at 06:48
And re-reading your comment, I did not state that I would ALWAYS assume my data is ASCII, I said for "the above example" - without full context, who can say what is correct or not. There must be cases where '\d' is a valid use, otherwise the digit attribute is a bit of a waste of time. – Beano May 18 '09 at 06:54

score 1 · Answer 3 · answered May 17 '09 at 17:10

My initial thought was to use named captures and to get the values from %-:

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )+
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

Unfortunately, the (?:...) grouping doesn't trigger capturing multiple values for B and C. I suspect that this is a bug. Doing it explicitly does capture all the values but you would have to know the maximum number of instances ahead of time.

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    \(Int\s+"B"\s+(?<B>[0-9]+)\)
    \(Int\s+"C"\s+(?<C>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    # repeat (?:...) N times
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

The simplest approach is to use m//g. You can either capture name/value pairs as Beano suggests or use multiple patterns to capture each value:

my @b = m/Int "B" ([0-9]+)/g;
my @c = m/Int "C" ([0-9]+)/g;
# etc.

Captures inside of quantified matches only return the last capture, this isn't really a bug or feature, just the way they work. As far as I know, C# has the only implementation that captures multiple times out of a quantified match. — Chas. Owens, May 17 '09 at 17:38

Perl: extracting data from text using regex

3 Answers3

Linked