Regex to parse a multiline HTML

Question

am trying to parse a multi-line html file using regex.

HTML code:

<td>Details</td></tr>  
<tr class=d1>
<td>uss_vod_translator</td>

Regex Expression:

if ($line =~ m/Details<\/td>\s*<\/tr>\s*<tr\s*class=d1>\s*<td>(\w*)<\/td>/)
{
    print "$1";
}

I am using /s* (space) for multi-line, but it is not working. I searched about it, even used /\? for multi-line but that too did not work.

Can any one please suggest me how to parse a multiline HTML?

I know regex is a poor solution to parse HTML. But i have a legacy HTML code which i need to parse and have no other choice.

[Regex is a poor solution for parsing HTML](http://stackoverflow.com/a/1732454/1583), in general. — Oded, Nov 06 '12 at 10:50
Judging by your variable name, You only have one line, so how can match something that spans more than one line? — ikegami, Nov 06 '12 at 10:52
The best possible answer has been written some time ago by someone else: http://stackoverflow.com/a/1732454/1065241 — zzzz, Nov 06 '12 at 10:58

score 12 · Answer 1 · answered Nov 06 '12 at 10:53

Can any one please suggest me how to parse a multiline HTML?

Stop trying to use regular expressions and use a module that will parse it for you.

HTML::TreeBuilder is a good solution.

HTML::TreeBuilder::LibXML gives you the same API but backed by a fast parser.

HTML::TreeBuilder::XPath adds XPath support as well as a fast parser.

score 0 · Answer 2 · answered Dec 06 '12 at 10:48

As stated above Never use regexes to parse HTML.

I'm using HTML::TreeBuilder::XPath to parse HTML and this dramatically decrease creation time for each of my scraping/parsing programs.

Here is how you task could be implemented:

use Modern::Perl;
use HTML::TreeBuilder::XPath;

my $html = <<END;
<tr><td>General Info</td></tr>  
<tr class=d1>
<td>some info</td></tr>
<tr><td>Details</td></tr>  
<tr class=d1>
<td>uss_vod_translator</td></tr>
<tr><td>Another header</td></tr>  
<tr class=d1>
<td>some other info</td></tr>
END

my $tree = HTML::TreeBuilder::XPath->new_from_content($html);

my ($details) = $tree->findvalues('//tr[ td[ text() = "Details" ] ]/following-sibling::tr[1]/td[1]');
say $details;

Madhankumar · Answer 3 · 2012-11-06T11:53:27.123

-3

Try the below line before you match your pattern

 $line=~s/>(\n|\t|\s)+</></gs;

Then you can made the HTML string as in single line.

edited Nov 06 '12 at 11:53

answered Nov 06 '12 at 11:25

Madhankumar

71
1
6

Regex to parse a multiline HTML

3 Answers3

Linked