match a string and get the word next to it in perl

Question

I have an html file "statistics.htm". Got the data in the html file to a variable.

suppose in the variable i have below data:

<ul class="chart-legend">
    <li class="label-1">
      <div><em></em>FTP<br>
      0 B</div>
    </li>
    <li class="label-2">
      <div><em></em>HTTP<br>
      589 KB</div>
    </li>
    <li class="label-3">
      <div><em></em>POP3/IMAP<br>
      0 B</div>
    </li>
    <li class="label-4">
      <div><em></em>SMTP<br>
      0 B</div>
    </li>
</ul>

suppose if customer gives FTP as an argument i want the FTP value which is 0 B in above case.

How can i achieve this?

[Do not use regular expressions to parse HTML.](http://stackoverflow.com/a/1732454/1331451) — simbabque, Sep 26 '12 at 13:10
Agreed. Don't do this unless you are feeling fast & flakey™: `$ftp = $1 if $var =~ m|FTP
\s*(.*?)<|` — bobbogo, Sep 26 '12 at 13:18
Actually, a solution like bobbogo's can be ok if you are working with a limited, controlled set of HTML pages that you know will always be in the exact same format. But in general, regexes on HTML are a bad idea. — dan1111, Sep 26 '12 at 13:24

score 1 · Answer 1 · edited May 23 '17 at 11:43

1

There are several Perl modules that parse HTML for you. I suggest you try one of those, and then ask a specific question if you have any problems.

Lots of information about this is available on SO and the web. Here is one example question that will point you to some of the modules available: How to parse between <div class ="foo"> and </div> easily in Perl.

edited May 23 '17 at 11:43

Community

1
1

answered Sep 26 '12 at 13:10

dan1111

6,576
2
18
29

score 1 · Accepted Answer · answered Sep 26 '12 at 13:36

You can do this very simply with HTML::TreeBuilder::XPath (OK, very simply until you get to the fun XPath query!):

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $html= HTML::TreeBuilder::XPath->new->parse_file( \*DATA);
my $ftp= $html->findnodes( 'normalize-space( //div/br[./preceding-sibling::text()="FTP"]/following-sibling::text())');
print "$ftp\n";


__DATA__
<ul class="chart-legend">
    <li class="label-1">
      <div><em></em>FTP<br>
      0 Ba</div>
    </li>
    <li class="label-2">
      <div><em></em>HTTP<br>
      589 KB</div>
    </li>
    <li class="label-3">
      <div><em></em>POP3/IMAP<br>
      0 Bb</div>
    </li>
    <li class="label-4">
      <div><em></em>SMTP<br>
      0 Bc</div>
    </li>
</ul>

The XPath expression: look for a br in a div, one which previous sibling text is 'FTB' (you may want to normalize spaces there). Then take the next sibling text. Wrap this in a normalize-space to clean-up the result, Voilà!

match a string and get the word next to it in perl

2 Answers2