Perl script to extract particular div section from HTML code

Question

I am having a HTML file which is very large. I need to extract particular <div>...</div> section in a variable.

##some contents
<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
</td></tr><tr><td class="h4">Start Time(17:58:02)/ Completion Time (18:26:33)</td><td class="info"></td></tr></table></td></tr></table></div></div>
##some contents

I used regex, like

my $html_filepath = "G:\\Report.html";
open(HTML, "<$html_filepath") or die "Can't open $html_filepath $!\n";
$body .= "\nTest Report Summary:\n\n";
my $content;
my $summarySection;
{
    local $/ = undef; # slurp mode
    $content = <HTML>;
}
$content =~ s/\r\n//g;
#print $content;

if ($content ne "")
{
    if ($content =~ m/<div class="title-bar" (.*)/)
    #if ( $last_line =~ m/^<tr> <td>(\d+)<\/td>/ )
    {
        $summarySection = "$1";
    }
}
print "\n $summarySection";

Output I got is:

<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016

But I need the output like,

<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
</td></tr><tr><td class="h4">Start Time(17:58:02)/ Completion Time (18:26:33)</td><td class="info"></td></tr></table></td></tr></table></div></div>

I have tried the following regex,

if ($content =~ m/<div class="title-bar" (.*)<\/table><\/div><\/div>/)

But this did not work.

Please give me some ideas to get the content including the line break, newline and white space.

Please note, I am not trying to extract the content of the div section like parsing. I just need to extract the particular section of the code and I need it as a separate executable HTML file. — Jeya Suriya Muthumari, Feb 23 '16 at 12:57
Using regular expressions to get data from HTML is often tricky, and prone to breaking if anything changes in the output (and then you end up with missing data, or two items bundled together, etc.). I strongly recommend you use an HTML parser. You can then either use the parsed HTML to access various bits of information (that's definitely the best option), or output the selected node as HTML. — jcaron, Feb 23 '16 at 13:19
Note that in your code, if you want to replace linefeeds/carriage returns, it's probably better to match `[\r\n]+` (which will match any combination such as `\r` (Mac style), `\n` (Unix style), `\r\n` (DOS style). — jcaron, Feb 23 '16 at 13:20
Alternatively, you may want to add the `s` flag to your regular expressions so that `.` also matches newline. But you'll need to anchor the end of your regular expression. — jcaron, Feb 23 '16 at 13:21
@jcaron: The better way is to use `\R` which matches any standard line ending — Borodin, Feb 23 '16 at 15:04

score 4 · Answer 1 · edited May 23 '17 at 12:33

4

Please don't use regexp to parse HTML. Use a perl module to parse HTML.

Something like HTML::TreeBuilder:

use strict;
use warnings;
use HTML::TreeBuilder 5 -weak; # Ensure weak references

my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($html_filepath);
my $elem = $tree->look_down('_tag' => 'div', 'class' => 'title-bar');
warn $elem->as_HTML;

The problem with your regexp is that . does not match newline. Read this to know how to match all characters: Regex to match any character including new lines

The way to fix this is using the s (Treat string as single line) modifier:

if ($content =~ m/<div class="title-bar" (.*)<\/table><\/div><\/div>/s)

edited May 23 '17 at 12:33

Community

1
1

answered Feb 23 '16 at 13:21

bolav

6,938
2
18
42

I'm not sure of the wisdom of suggesting `[\S\s]`, which is really a JavaScript hack. The Perl way is to use `/s` as you described, and Python users will want to enable the `DOTALL` flag – Borodin Feb 23 '16 at 13:58
I agree, I just wanted a solution that didn't use modifiers as well. I will delete that solution. – bolav Feb 23 '16 at 14:04
You can use `(?s)` inside the pattern if it's portability you're after? – Borodin Feb 23 '16 at 14:14

Perl script to extract particular div section from HTML code

1 Answers1