-5

Have tried it with my own expression to get it to work with my project. But after several hours of looking at the internet, I still don't get it to work. Trying the code below (not mine) but it does only produce the "die" statement. My own code (another question on here) only returns "Press any key.." What am I doing wrong?

#!/usr/bin/perl -w
# cookbook-rank - find rank of Perl Cookbook on Amazon

use LWP::Simple;

my $html = get("http://www.amazon.com/exec/obidos/ASIN/1565922433")
  or die "Couldn't fetch the Perl Cookbook's page.";
$html =~ m{Amazon\.com Sales Rank: </b> ([\d,]+) </font><br>} || die;
my $sales_rank = $1;
$sales_rank =~ tr[,][]d;    # 4,070 becomes 4070
print "$sales_rank\n";
pierrefelipe
  • 71
  • 1
  • 1
  • 6
  • What exact error message do you get when you run this script? – Dre Feb 14 '15 at 17:31
  • http://puu.sh/fUBcn/56dd545dca.png – pierrefelipe Feb 14 '15 at 17:36
  • 2
    Please copy/paste the error message as text instead. Picture links are annoying. – tripleee Feb 14 '15 at 17:43
  • 1
    Parsing HTML with regular expressions is a losing game. A small change in formatting can break your code, which is what happened here. What you really want is an HTML parser and to use XPath to find the elements you want by their ID (here it's #SalesRank). That's [another question which has already been answered](http://stackoverflow.com/a/4598384/14660). Better yet, rather than scraping the page, which is slow and prone to change, you should use an API if available. – Schwern Feb 14 '15 at 18:04

2 Answers2

4

The die happens when the downloaded content does not contain any text which matches the regex. There's nothing wrong with LWP or with the code itself, other than the assumption that the download will match. (The die statement had better contain an explanation of what went wrong, though.)

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • @SinanÜnür Could you please offer a rationale for your edit? I wasn't very happy with the wording of that passage but I'm not sure removing it entirely is the right solution. – tripleee Feb 15 '15 at 07:04
  • Just my mobile phone's browser. I guess it doesn't support this patricular redirect. Thanks for the explanation. – tripleee Feb 15 '15 at 09:34
0

Looks like the Amazon HTML has changed since that example was written. The page no longer contains the string "Amazon.com Sales Rank". Instead, it now says "Amazon Best Sellers Rank:".

But you'll need to look at the HTML source for the page. For some reason, Amazon insert over thirty blank lines between that label and the line containing the actual sales rank.

Which is, all in all, a nice example of why screen-scraping is a bad idea. You'd be much better advised to use Amazon's product API.

Dave Cross
  • 68,119
  • 3
  • 51
  • 97