1

I am trying to do a screen scrape in perl and have it down to a array of table elements.

the string:

<tr>
        <td>10:11:00</td>
        <td><a href="/page/controller/33">712</a></td>
        <td>Start</td>
        <td>Finish</td>
        <td>200</td>
        <td>44</td>

Code:

if($item =~ /<td>(.*)?<\/td>/)
            {
                print "\t$item\n";
                print "\t1: $1\n";
                print "\t2: $2\n";
                print "\t3: $3\n";
                print "\t4: $4\n";
                print "\t5: $5\n";
                print "\t6: $6\n";
            }

output:

1: 10:11:00
2: 
3: 
4: 
5: 
6: 

I tried multiple thing but could not get the intended results. thoughts?

Cœur
  • 37,241
  • 25
  • 195
  • 267
jeremyforan
  • 1,417
  • 18
  • 25

2 Answers2

5
use strict;
use warnings;

my $item = <<EOF;
<tr>
        <td>10:11:00</td>
        <td><a href="/page/controller/33">712</a></td>
        <td>Start</td>
        <td>Finish</td>
        <td>200</td>
        <td>44</td>
EOF

if(my @v = ($item =~ /<td>(.*)<\/td>/g))
{
  print "\t$item\n";
  print "\t1: $v[0]\n";
  print "\t2: $v[1]\n";
  print "\t3: $v[2]\n";
  print "\t4: $v[3]\n";
  print "\t5: $v[4]\n";
  print "\t6: $v[5]\n";
}

or

if(my @v = ($item =~ /<td>(.*)<\/td>/g))
{
  print "\t$item\n";
  print "\t$_: $v[$_-1]\n" for 1..@v;
}

Output:

1: 10:11:00
2: <a href="/page/controller/33">712</a>
3: Start
4: Finish
5: 200
6: 44
ikegami
  • 367,544
  • 15
  • 269
  • 518
perreal
  • 94,503
  • 21
  • 155
  • 181
1

The code behaves exactly as you told it to. This is what happens:

You matched the regex exactly once. It did match, and populated the $1 variable with the value of the first (and only!) capture buffer. The match returns "true", and the code in the if-branch is executed.

You want to do two things:

  1. Match with the /g modifier. This matches globally, and tries to return every match in the string, not just the first one.
  2. Execute the regex in list context, so you can save the capture buffers to an array

This would lead to the following code:

if ( my @matches = ($item =~ /REGEX/g) ) {
  for my $i (1 .. @matches) {
    print "$i: $matches[$i-1]\n";
  }
}

Do also note that parsing HTML with regexes is evil, and you should search CPAN for a module you like that does that for you.

Community
  • 1
  • 1
amon
  • 57,091
  • 2
  • 89
  • 149