I am using Perl to connect to a Site, parse its HTML and extract the innerHTML in between the tags. I am trying the easier concept first before trying advanced concepts.
I use LWP::UserAgent to craft my HTTP GET Request to the site and receive my response.
I Store the response in an array as follows:
@res = ($ua->request($req))->content;
Edit: HTML to be parsed:
<div class="new"> this is Line 1 </div>
<div>
this is Line 2 </div>
Now, I parse each line in the HTTP Response and extract the text between the tags:
foreach $line(@res)
{
chomp $line;
if($line =~ /<div[^>]*?>(.*)<\/div>/)
{
$match = $1;
print OUTPUT $match."\n";
}
}
The problems with the above code snippet are:
It matches only the innerHTML for the first successful match. It does not print all the successful matches. I am not sure why, the loop should be working according to me. The value of the variable, $match should be overwritten with the contents of capture buffer after every successful match.
It will not be able to extract the text between the innerHTML if the tag spans across multiple lines. You have the opening div tag on the first line, innerHTML on the next line and the closing div tag on the following line.
I am unable to write the HTML in this post, so have given the description.
Any help would be appreciated.