-7

am trying to parse a multi-line html file using regex.

HTML code:

<td>Details</td></tr>  
<tr class=d1>
<td>uss_vod_translator</td>

Regex Expression:

if ($line =~ m/Details<\/td>\s*<\/tr>\s*<tr\s*class=d1>\s*<td>(\w*)<\/td>/)
{
    print "$1";
}

I am using /s* (space) for multi-line, but it is not working. I searched about it, even used /\? for multi-line but that too did not work.

Can any one please suggest me how to parse a multiline HTML?

I know regex is a poor solution to parse HTML. But i have a legacy HTML code which i need to parse and have no other choice.

dreamer
  • 478
  • 1
  • 11
  • 24
  • 11
    [Regex is a poor solution for parsing HTML](http://stackoverflow.com/a/1732454/1583), in general. – Oded Nov 06 '12 at 10:50
  • 1
    Judging by your variable name, You only have one line, so how can match something that spans more than one line? – ikegami Nov 06 '12 at 10:52
  • 3
    The best possible answer has been written some time ago by someone else: http://stackoverflow.com/a/1732454/1065241 – zzzz Nov 06 '12 at 10:58

3 Answers3

12

Can any one please suggest me how to parse a multiline HTML?

Stop trying to use regular expressions and use a module that will parse it for you.

HTML::TreeBuilder is a good solution.

HTML::TreeBuilder::LibXML gives you the same API but backed by a fast parser.

HTML::TreeBuilder::XPath adds XPath support as well as a fast parser.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
0

As stated above Never use regexes to parse HTML.

I'm using HTML::TreeBuilder::XPath to parse HTML and this dramatically decrease creation time for each of my scraping/parsing programs.

Here is how you task could be implemented:

use Modern::Perl;
use HTML::TreeBuilder::XPath;

my $html = <<END;
<tr><td>General Info</td></tr>  
<tr class=d1>
<td>some info</td></tr>
<tr><td>Details</td></tr>  
<tr class=d1>
<td>uss_vod_translator</td></tr>
<tr><td>Another header</td></tr>  
<tr class=d1>
<td>some other info</td></tr>
END

my $tree = HTML::TreeBuilder::XPath->new_from_content($html);

my ($details) = $tree->findvalues('//tr[ td[ text() = "Details" ] ]/following-sibling::tr[1]/td[1]');
say $details;
gangabass
  • 10,607
  • 2
  • 23
  • 35
-3

Try the below line before you match your pattern

 $line=~s/>(\n|\t|\s)+</></gs;

Then you can made the HTML string as in single line.

Madhankumar
  • 71
  • 1
  • 6