11

I'm trying to fetch Wikipedia pages using LWP::Simple, but they're not coming back. This code:

#!/usr/bin/perl
use strict;
use LWP::Simple;

print get("http://en.wikipedia.org/wiki/Stack_overflow");

doesn't print anything. But if I use some other webpage, say http://www.google.com, it works fine.

Is there some other name that I should be using to refer to Wikipedia pages?

What could be going on here?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Jesse Beder
  • 33,081
  • 21
  • 109
  • 146

5 Answers5

18

Apparently Wikipedia blocks LWP::Simple requests: http://www.perlmonks.org/?node_id=695886

The following works instead:

#!/usr/bin/perl
use strict;
use LWP::UserAgent;

my $url = "http://en.wikipedia.org/wiki/Stack_overflow";

my $ua = LWP::UserAgent->new();
my $res = $ua->get($url);

print $res->content;
Jesse Beder
  • 33,081
  • 21
  • 109
  • 146
  • I am getting error **500 Can't connect to en.wikipedia.org:443** for the given wiki URL, however for stackoverflow home page http://stackoverflow.com, it gives 403. I've added `$ua->agent("WikiBot/0.1");` before calling `get` method, this worked cool for many sites including stackoverflow. But it still gives error on wiki page, same error as mentioned above. – Kamal Nayan Apr 18 '16 at 07:12
  • Added `$ua = LWP::UserAgent->new(ssl_opts => { verify_hostname => 0 }); ` and this error **500 Can't connect to en.wikipedia.org:443** got fixed. It may help someone else. – Kamal Nayan Apr 18 '16 at 08:08
11

You can also just set the UA on the LWP::Simple module - just import the $ua variable, and it'll allow you to modify the underlying UserAgent:

use LWP::Simple qw/get $ua/;
$ua->agent("WikiBot/0.1");
print get("http://en.wikipedia.org/wiki/Stack_overflow");
zigdon
  • 14,573
  • 6
  • 35
  • 54
6

I solved this problem using LWP:RobotUA instead of LWP::UserAgent. You can read the document below. There are not much differences you should modify.

http://lwp.interglacial.com/ch12_02.htm

Samed Konak
  • 86
  • 1
  • 2
5

Because Wikipedia is blocking the HTTP user-agent string used by LWP::Simple.

You will get a "403 Forbidden"-response if you try using it.

Try the LWP::UserAgent module to work around this, setting the agent-attribute.

SHODAN
  • 1,249
  • 10
  • 8
5

Also see the Mediawiki related CPAN modules - these are designed to hit Mediawiki sites (of which wikipedia is one) and might give you more bells and whistles than simple LWP.

http://cpan.uwinnipeg.ca/search?query=Mediawiki&mode=dist

Jonathan Swartz
  • 1,913
  • 2
  • 17
  • 28