I've been working on extracting info from Metacritic, but I've now run into the problem of not being able to extract cleanly text that has apostrophes or dashes.
This problem is illustrated in the following code:
use WWW::Mechanize;
$reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews';
$Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.';
$l = WWW::Mechanize->new();
$l->get($reviewspage);
$k = $l->content;
@Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s;
print "@Review\n";
Outputs:
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
Even though the coding on the website is:
<div class="review_body">
Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.
</div>
I've created similar scripts before that have used WWW::Mechanize and none of them have substituted out characters like this.