0

Has anyone run perl script given at http://oreilly.com/pub/h/974#code ?

This is a famous one, used to get URLs from Yahoo! directory and many people have successfully used it.

I was trying to get URLs. I created my own Google API key and replaced that in the code. Apart from that I did not make any change.

Script is neither producing any error nor any URL.

#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;

my $google_key  = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir   = shift || "/Computers_and_Internet/Data_Formats/XML_  _".
              "eXtensible_Markup_Language_/RSS/News_Aggregators/";

# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;

# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.

# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);

sub mindshare { # for each link we find...

  my ($tag, %attr) = @_;

  print "$tag\n";   

  # continue on only if the tag was a link,

  # and the URL matches Yahoo!'s redirectory.

  return if $tag ne 'a';   

  return unless $attr{href} =~ /srd.yahoo/;

  return unless $attr{href} =~ /\*http/;



  # now get our real URL.

  $attr{href} =~ /\*(http.*)/; my $url = $1;

  print "hi";

  # and process each URL through Google.

  my $results = $google_search->doGoogleSearch(

                      $google_key,"link:$url", 0, 1,

                      "true", "", "false", "", "", ""

                ); # wheee, that was easy, guvner.

  $urls{$url} = $results->{estimatedTotalResultsCount};

  print "1\n";

} 

# now sort and display.

my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;

foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }

Program goes into the loop, and comes out at first iteration to "my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;".

I don't have any understanding about perl but this task should have been trivial.

Surely,I am missing something very obvious, because this script has been successfully used by many.

Thanks in advance.

Kara
  • 6,115
  • 16
  • 50
  • 57
instanceOfObject
  • 2,936
  • 5
  • 49
  • 85

1 Answers1

1

Are you supplying a directory to the script? Because if you are not, and this line in your script

"/Computers_and_Internet/Data_Formats/XML_  _".
              "eXtensible_Markup_Language_/RSS/News_Aggregators/"

is not a formatting artefact, then you're trying to scrape a non-existent page.

Alien Life Form
  • 1,884
  • 1
  • 19
  • 27
  • Hey, Even running it as "perl mindshare.pl "/Entertainment/Humor/Procrastination/"" doesn't help. "GoogleSearch.wsdl" is present in the same directory. Anything else that I need to do? – instanceOfObject Feb 08 '12 at 12:34
  • I tried even without using "GoogleAPIKey" and modifying function definition to recv lesser arguments. – instanceOfObject Feb 08 '12 at 12:43
  • I don't have a Googleapi key or time to test the script at this time but: run the script as perl -d mindshare.pl, it wuill drop you in the perl debugger. Single step through the procedure, 'p' and 'x' will help you in seeing what's in the variables (and 'l' will list your location). From what you say, the associative array %urls is not being filled. – Alien Life Form Feb 08 '12 at 14:16
  • Yes, you are correct %urls is not being filled, because even $googlr_wsdl and $yahoo_dir are not getting printed. I am getting some uninitialized value - "Use of uninitialized value in print at (eval 98)[/usr/lib/perl5/5.8.8/perl5db.pl:628] line 2. – instanceOfObject Feb 09 '12 at 05:51
  • Hey, None of the URLs are able to pass these lines : return unless $attr{href} =~ /srd.yahoo/; return unless $attr{href} =~ /\*http/; Any thoughts? – instanceOfObject Feb 10 '12 at 11:30
  • put: print STDOUT $data; after the my $data line. What do you see? – Alien Life Form Feb 10 '12 at 16:03
  • Hey, Does this script really scrape all the URLs belonging to a directory? Seems link, it scrapes only URLs present at depth 1. Correct me if i am wrong. – instanceOfObject Feb 11 '12 at 16:49