1

I have an html file "statistics.htm". Got the data in the html file to a variable.

suppose in the variable i have below data:

<ul class="chart-legend">
    <li class="label-1">
      <div><em></em>FTP<br>
      0 B</div>
    </li>
    <li class="label-2">
      <div><em></em>HTTP<br>
      589 KB</div>
    </li>
    <li class="label-3">
      <div><em></em>POP3/IMAP<br>
      0 B</div>
    </li>
    <li class="label-4">
      <div><em></em>SMTP<br>
      0 B</div>
    </li>
</ul>

suppose if customer gives FTP as an argument i want the FTP value which is 0 B in above case.

How can i achieve this?

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Cindrella
  • 1,671
  • 7
  • 27
  • 47
  • Can it be acheived using regular expressions? – Cindrella Sep 26 '12 at 13:07
  • 4
    [Do not use regular expressions to parse HTML.](http://stackoverflow.com/a/1732454/1331451) – simbabque Sep 26 '12 at 13:10
  • 1
    Agreed. Don't do this unless you are feeling fast & flakey™: `$ftp = $1 if $var =~ m|FTP
    \s*(.*?)<|`
    – bobbogo Sep 26 '12 at 13:18
  • Actually, a solution like bobbogo's can be ok if you are working with a limited, controlled set of HTML pages that you know will always be in the exact same format. But in general, regexes on HTML are a bad idea. – dan1111 Sep 26 '12 at 13:24

2 Answers2

1

There are several Perl modules that parse HTML for you. I suggest you try one of those, and then ask a specific question if you have any problems.

Lots of information about this is available on SO and the web. Here is one example question that will point you to some of the modules available: How to parse between <div class ="foo"> and </div> easily in Perl.

Community
  • 1
  • 1
dan1111
  • 6,576
  • 2
  • 18
  • 29
1

You can do this very simply with HTML::TreeBuilder::XPath (OK, very simply until you get to the fun XPath query!):

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $html= HTML::TreeBuilder::XPath->new->parse_file( \*DATA);
my $ftp= $html->findnodes( 'normalize-space( //div/br[./preceding-sibling::text()="FTP"]/following-sibling::text())');
print "$ftp\n";


__DATA__
<ul class="chart-legend">
    <li class="label-1">
      <div><em></em>FTP<br>
      0 Ba</div>
    </li>
    <li class="label-2">
      <div><em></em>HTTP<br>
      589 KB</div>
    </li>
    <li class="label-3">
      <div><em></em>POP3/IMAP<br>
      0 Bb</div>
    </li>
    <li class="label-4">
      <div><em></em>SMTP<br>
      0 Bc</div>
    </li>
</ul>

The XPath expression: look for a br in a div, one which previous sibling text is 'FTB' (you may want to normalize spaces there). Then take the next sibling text. Wrap this in a normalize-space to clean-up the result, Voilà!

mirod
  • 15,923
  • 3
  • 45
  • 65