2

I am using Perl to connect to a Site, parse its HTML and extract the innerHTML in between the tags. I am trying the easier concept first before trying advanced concepts.

I use LWP::UserAgent to craft my HTTP GET Request to the site and receive my response.

I Store the response in an array as follows:

@res = ($ua->request($req))->content;

Edit: HTML to be parsed:

<div class="new"> this is Line 1 </div>
<div>
      this is Line 2 </div>

Now, I parse each line in the HTTP Response and extract the text between the tags:

foreach $line(@res)
{
chomp $line;
if($line =~ /<div[^>]*?>(.*)<\/div>/)
{
    $match = $1;
    print OUTPUT $match."\n";
}
}

The problems with the above code snippet are:

  1. It matches only the innerHTML for the first successful match. It does not print all the successful matches. I am not sure why, the loop should be working according to me. The value of the variable, $match should be overwritten with the contents of capture buffer after every successful match.

  2. It will not be able to extract the text between the innerHTML if the tag spans across multiple lines. You have the opening div tag on the first line, innerHTML on the next line and the closing div tag on the following line.

I am unable to write the HTML in this post, so have given the description.

Any help would be appreciated.

Neon Flash
  • 3,113
  • 12
  • 58
  • 96

3 Answers3

3

Using a robust HTML parser:

use HTML::TreeBuilder::XPath qw();

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse($http_response->content);

for my $node ($tree->findnodes('//div')) {
    print $_->as_HTML for @{ $node->content_array_ref };
}
daxim
  • 39,270
  • 4
  • 65
  • 132
  • Thanks. I understand that using an HTML parser can ease the work and also make the code more efficient while coding. However, I wanted to write a script which would not require the user to install extra packages from the CPAN. However, it would be helpful if you could write a bit extra to explain how this code will extract the innerHTML between the div tags. – Neon Flash Jun 26 '12 at 16:55
  • Why do want to write a script which would not require the user to install extra packages from the CPAN? Forgot to link to the documentation: [parse](http://p3rl.org/HTML::Parser#p-parse-string-), [findnodes](http://p3rl.org/HTML::TreeBuilder::XPath#findnodes-path-), [as_HTML](http://p3rl.org/HTML::Element#as_HTML), [content_array_ref](http://p3rl.org/HTML::Element#content_array_ref) – daxim Jun 26 '12 at 17:15
1

You should use progressive matching to extract all matches from a line. For example, if $line holds the string This is a div, followed by a span, and you want to extract This is a div, followed by a and span, you can use something like this:

print "$2\n" while $line =~ /<(.*?)>(.*?)<\/\g{1}>/g;

Of course you want to parse nested elements as well, it's gonna be whole lot more difficult and tricky. As per you second problem, you need multi-line mode. The best would be to use the \s modifier, which will force . to match newline also. Or maybe you can merge all the lines together by assigning the filehandle to a scalar variable directly.

SexyBeast
  • 7,913
  • 28
  • 108
  • 196
  • Thanks. Could you please explain the regular expression you used in more detail? I understood that the innerHTML will be stored in the second capture buffer which you are printing. However, when I replaced my regular expression with yours in the code, it gives a blank file. Could you elaborate with an example? That would really help. You can use my code as an example and modify it with your regular expression. – Neon Flash Jun 26 '12 at 16:46
  • Sorry, I blindly copied the text from somewhere and wrote the regex without seeing it properly! Is it clear now? If not, I will explain. – SexyBeast Jun 26 '12 at 17:21
  • Yeah, thanks. I have edited my post and added the HTML code which I am trying to parse. I have mentioned two div tags and want to extract the innerHTML in between them. Could you please write your regular expression to extract them? That way, I can also test. At present, your regular expression is written to extract the content between all the tags from what I understand. I will try it out soon. – Neon Flash Jun 27 '12 at 09:34
  • Ummm, isn't innerHTMl the content between two tags? Like if the element is **
    Hi, I am Cupid
    **, won't the innerHTML be **Hi, I am Cupid**?
    – SexyBeast Jun 27 '12 at 10:28
  • So is your problem the fact that the tags have attributes? Otherwise my code will extract the string between both divs...If you want I can provide the code which takes care of tags with attributes as well... – SexyBeast Jun 27 '12 at 10:45
  • Yes, that is the innerHTML. However, please look at the HTML I have posted. Your code does not answer that. It just extracts all the innerHTML between all the tags. I want it to be specific to div tags only! – Neon Flash Jun 27 '12 at 15:15
  • Oh then, just use this regex: print "$+{name}\n" while $line =~ /(
    ["']?)\w+\k)*))>(?P.*?)<\/\div>/g;. In fact you could use the previous version also. it was flexible, allowing you to extract other tags' content also. You could have filtered out the div's by something like **if (index($1,"
    -1)**. This one does the same, while extracting the content between divs only, plus taking care of any number of attributes as well.
    – SexyBeast Jun 27 '12 at 16:04
0

If you want to make it generic enough and suitable for real-world application, it's a bit more complicated.

First, you probably want to get rid of the content between <script> and </script> tags.

Second, you cannot assume that opening tag always contains the same text, e.g. text in <span class="myclass"> is not quite the same as in </span>.

I would suggest getting rid of all <something> tags, regardless of what kind of tag was that, and also removing the <script> tag.

You probably can't get away with just one super-smart regexp, you'd rather use multiple regexps to do the job.

Here is a little script I've put together, works fine on cnn.com (as a sample of non-trivial input). I tried to preserve the line breaks, just to print it nicely, and removed the empty lines -- but obviously, all this might not be necessary.

I also did some dirty trick here, by hiding \n with a dummy \\\\NN string (replacing <script> globally wouldn't work otherwise).

    my $text = "";
    foreach my $line (@res)
    {
        chomp $line;
        $text .= $line . "\\\\NN"; # Hiding the \n's
    }

    $text =~ s/(<script(\s[^<]*)?>.*?<\/script>)//gi;
    $text =~ s/<.*?>/ /g;

    # Beautify it... :)
    $text =~ s/\s{2,}/ /g;
    $text =~ s/\s*\\\\NN\s*/\\\\NN/g;
    $text =~ s/(\\\\NN){2,}/\\\\NN/g;
    $text =~ s/\\\\NN/\n/g;

    print $text."\n";
Peter Al
  • 378
  • 1
  • 6