0

i was trying to grab some data from a horsing data site http://www.hkjc.com/english/racing/horse.asp?horseno=P278

basically i wget the above URl and loop thru each line of the HTML source code and check if the line match regular expression, if it does, i grab some data and store it in a variable

example of the source (it should be in line 384-386):

<td class=htable_eng_text align=center>
2      
</td>   

I was trying to match the digit "2" but i failed

i did something like this:

if ($line =~ /\s+([1-12])\s+/) {
#if match, then stored $1 in a variable
}

Here's my questions: not sure if i should add \s+ at the end and not sure if i could do ([1-12]) as the digit i want to grab in a range of 1-12 (i also try (\d||11||12)) but failed)...

i m new to perl and hope someone could help. Thank you!!!

anubhava
  • 761,203
  • 64
  • 569
  • 643
mark
  • 53
  • 7

3 Answers3

2

To check range from numeric 1-12 you can use this regex:

/\b([1-9]|1[0-2])\b/
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Don't use a regex to parse HTML. See:RegEx match open tags except XHTML self-contained tags

You should REALLY consider using an HTML parser instead. Because in your case, you're manipulating a table, you can use HTML::TableExtract. This will turn your table into a nice friendly Perl data structure, and your code will be less brittle.

 use HTML::TableExtract;
 $te = HTML::TableExtract->new( headers => [qw(Date Price Cost)] );
 $te->parse($html_string);

 # Examine all matching tables
 foreach $ts ($te->tables) {
   print "Table (", join(',', $ts->coords), "):\n";
   foreach $row ($ts->rows) {
      print join(',', @$row), "\n";
   }
 }
Community
  • 1
  • 1
Sobrique
  • 52,974
  • 7
  • 60
  • 101
1

Here's an example of using HTML::TableExtract as @Sobrique recommended. Without knowing what information you want, I guessed at the Race Index and Draw columns.

You can change the columns retrieved using the headers parameter of the HTML::TableExtract->new call, for instance [ qw/ Index Dr Jockey Class Course / ], but you would have to change the print loop to display them.

use strict;
use warnings;

use LWP::Simple 'get';
use HTML::TableExtract;

my $url = 'http://www.hkjc.com/english/racing/horse.asp?horseno=P278';
my $html = get $url;

my $extract = HTML::TableExtract->new(
    headers => [ qw/ Index Dr /],
);
$extract->parse($html);

my $table = $extract->first_table_found;

print "Race\n";
print "Index Draw\n";

for my $row ($table->rows) {
  my ($index, $dr) = @$row;

  if ( $dr ) {
    $dr =~ s/\s+//g;
    printf "%-5s %-s\n", $index, $dr;
  }
  else {
    print $index, "\n";
  }
}

output

Race
Index Draw
 14/15 Season
390   4
334   9
273   3
207   2
156   7
129   11
098   10
009   10
 13/14 Season
682   12
637   8
596   8
570   6
452   5
372   4
332   3
272   6
218   5
144   9
098   12
033   10
 12/13 Season
728   1
598   9
Borodin
  • 126,100
  • 9
  • 70
  • 144