Regex in Perl messed up by bracket

Question

I am new to perl and having the following problem recently.

I have a string with format " $num1 $num2 $num3 $num4", that $num1, $num2, $num3, $num4 are real numbers can be a scientific number or in regular format.

Now I want to extract the 4 numbers from the string using regular expression.

$real_num = '\s*([+-]?[0-9]+\.?[0-9]*([eE][+-]?[0-9]+)?)'
while (<FP>) {
    if (/$real_num$real_num$real_num$real_num/) {
        print $1; print $2; print$3; print$4;
    }
}

How can I get $num1, $num2, $num3, $num4 from $1, $2, $3, $4? As there is a necessary bracket in the $real_num regular expression so $1, $2, $3, $4 are not what I am expecting now.

Thanks for all warm replies, non-capturing group is the answer I need!

http://stackoverflow.com/questions/638565/parsing-scientific-notation-sensibly — , Jun 20 '13 at 06:16
oh how are the 4 numbers separated? You could split and iterate through them — , Jun 20 '13 at 06:18
do you mean parentheses `()` when you say brackets (which is `[]`)? Anyway it is not "necessary", you can make the parentheses non-capturing as detailed in [Rohit's answer](http://stackoverflow.com/a/17206530/1743811). — doubleDown, Jun 20 '13 at 07:39

Rohit Jain · Accepted Answer · 2013-06-20T06:27:19.480

5

Just use non-capturing groups in your $real_num regex and make the regex itself a captured group:

$real_num = '\s*([+-]?[0-9]+\.?[0-9]*(?:[eE][+-]?[0-9]+)?)'

Now, the problem is: /$real_num$real_num$real_num$real_num/ will easily fail, if there are more than 4 numbers out there. May be this is not the case now. But, you should take care of that also. A split would be a better option.

edited Jun 20 '13 at 06:27

answered Jun 20 '13 at 06:18

Rohit Jain

209,639
45
409
525

click on the "tick" icon, then! :-D – Massa Jun 20 '13 at 12:19
This fails on input as simple as .3, in addition to the problem you mentioned. For example, "9 2. .3 6 7 8" will return 3 6 7 8. – Joseph Myers Jun 20 '13 at 15:25
Thanks Myers, I'll pay attention to that :) – SpectreV Jun 21 '13 at 07:47

Miguel Prz · Answer 2 · 2013-06-20T06:29:27.960

3

If you are sure that your lines contains numbers, you can avoid that regexp, using split function:

while (<FP>) {
    my @numbers = split /\s+/; #<-- an array with the parsed numbers
}

If you need tho check if the extracted strings are really numbers, use the Scalar::Util looks_like_number. Example:

use strict;
use warnings;
use Scalar::Util qw/looks_like_number/;

while(<DATA>) {
    my @numbers = split /\s+/;
    @numbers = map { looks_like_number($_) ? $_ : undef } @numbers;
    say "@numbers";
}


__DATA__
1 2 NaN 4 -1.23
5 6 f 8 1.32e12

Prints:

1 2 NaN 4 -1.23
5 6  8 1.32e12

edited Jun 20 '13 at 06:29

answered Jun 20 '13 at 06:22

Miguel Prz

13,718
29
42

1

Why does no one realize that this code will produce scads of warnings about using uninitialized values in join or string if there is even one non-numerical piece of data present? Your answer is not bad in principle, but you should know at least to use grep not map for a job like this. – Joseph Myers Jun 20 '13 at 07:17
I'm running perl 5.18 and the warnings you said don't appears. Any way, this code try to show an idea; the concrete details of a better implementation isn't the point in this case. – Miguel Prz Jun 20 '13 at 07:32
Actually, even one warning appears even when you run the program on your own DATA. In the second line the `f` causes a warning to appear. Just use grep rather than map and your solution will work fine, i.e., `grep { looks_like_number($_) } @numbers`, but it will still be slower because of the use of the slower looks_like_number library subroutine. – Joseph Myers Jun 20 '13 at 07:39
as I said, there is no warnings in my environment, perl 5.18.0, what is it yours? I don't agree with the use of grep, probably you need to mark with undef and have a fix number of items in the array. – Miguel Prz Jun 20 '13 at 11:05
Using grep makes sense because then the first number is in $numbers[0], second one in $numbers[1], etc. Using map is impractical because the OP can't refer to the numbers with no way of knowing whether they are stored in $numbers[0] or $numbers[5] or any other random part of the array. – Joseph Myers Jun 20 '13 at 15:48

Joseph Myers · Answer 3 · 2013-06-20T07:35:03.273

The answers to two important questions will affect whether you even need to use a regular expression to match the various number formats, or if you can do something much simpler:

Are you certain that your lines contain numbers only or do they also contain other data (or possibly some lines have no numbers at all and only other data)?
Are you certain that all numbers are separated from each other and/or other data by at least one space? If not, how are they separated? (For example, output from portsnap fetch generates lots of numbers like this 3690....3700.... with decimal points and no spaces at all used to separate them.

If your lines contain only numbers and no other data, and numbers are separated by spaces, then you do not even need to check if the results are numbers, but only split the line apart:

my @numbers = split /\s+/;

If you are not sure that your lines contain numbers, but you are sure that there is at least one space between each number and other numbers or other data, then the next line of code is a quite good way of extracting numbers properly with a clever way of allowing Perl itself to recognize all the many different legal formats of numbers. (This assumes that you do not want to convert other data values to NaN.) The result in @numbers will be proper recognition of all numbers within the current line of input.

my @numbers = grep { 1*$_ eq $_ } m/(\S*\d\S*)/g;
# we could do simply a split, but this is more efficient because when
# non-numeric data is present, it will only perform the number
# validation on data pieces that actually do contain at least one digit

You can determine if at least one number was present by checking the truth value of the expression @numbers > 1 and if exactly four were present by using the condition @numbers == 4, etc.

If your numbers are bumped up against each other, for instance, 5.17e+7-4.0e-1 then you will have a more difficult time. That is the only time you will need complicated regular expressions.

Note: Updated code to be even faster/better.

Note 2: There is a problem with the most up-voted answer due to a subtlety of how map works when storing the value of undef. This can be illustrated by the output from that program when using it to extract numbers from the first line of data such as an HTTP log file. The output looks correct, but the array actually has many empty elements and one would not find the first number stored in $numbers[0] as expected. In fact, this is the full output:

$ head -1 http | perl prog1.pl
Use of uninitialized value $numbers[0] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[1] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[2] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[3] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[4] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[5] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[6] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[7] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[10] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[11] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[12] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[13] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[14] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[15] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[16] in join or string at prog1.pl line 8, <> line 1.
        200 2206

(Note that the indentation of these numbers shows how many empty array elements are present in @numbers and have been joined together by spaces before the actual numbers when the array has been converted to a string.)

However, my solution produces the proper results both visually and in the actual array contents, i.e., $numbers[0], $number[1], etc., are actually the first and second numbers contained in the line of the data file.

while (<>) {
my @numbers = m/(\S*\d\S*)/g;
@numbers = grep { $_ eq 1*$_ } @numbers;
print "@numbers\n";
}

$ head -1 http | perl prog2.pl

200 2206

Also, using the slow library function makes the other solution run 50% slower. Output was otherwise identical when running the programs on 10,000 lines of data.

Joseph Myers · Answer 4 · 2013-06-21T05:23:48.687

My previous answer did not address the issue of non-space separated numbers. This requires a separate answer in my opinion, since the output can be drastically different from the same data.

my $number = '([-+]?(?:\d+\.\d+|\.\d+|\d+)(?:[Ee][-+]\d+)?)';

my $type = shift;

if ($type eq 'all') {

while (<>) {
my @all_numbers = m/$number/g;
# finds legal numbers whether space separated or not
# this can be great, but it also means the string
# 120.120.120.120 (an IP address) will return
# 120.120, .120, and .120
print "@all_numbers\n";
}

} else {
while (<>) {
my @ss_numbers = grep { m/^$number$/ } split /\s+/;
# finds only space separated numbers
print "@ss_numbers\n";
}
}

Usage:

$ prog-jkm2.pl all < input # prints all numbers
$ prog-jkm2.pl < input # prints just space-separated numbers

The only code that the OP probably needs:

my $number = '(-?(?:\d+\.\d+|\.\d+|\d+)(?:[Ee][-+]\d+)?)';
my @numbers = grep { m/^$number$/ } split /\s+/;

At this point, $numbers[0] will be the first number, $numbers[1] is the second number, etc.

Examples of output:

  $ head -1 http | perl prog-jkm2.pl
200 2206
  $ head -1 http | perl prog-jkm2.pl all
67.195 .114 .38 19 2011 01 20 31 -0400 1 1 1.0 200 2206 5.0

Regex in Perl messed up by bracket

4 Answers4