0
use strict;

use LWP::UserAgent;

my $UserAgent = LWP::UserAgent->new;

my $response = $UserAgent->get("https://scholar.google.co.in/scholar_lookup?author=N.+R.+Alpert&author=S.+A.+Mohiddin&author=D.+Tripodi&author=J.+Jacobson-Hatzell&author=K.+Vaughn-Whitley&author=C.+Brosseau+&publication_year=2005&title=Molecular+and+phenotypic+effects+of+heterozygous,+homozygous,+and+compound+heterozygote+myosin+heavy-chain+mutations&journal=Am.+J.+Physiol.+Heart+Circ.+Physiol.&volume=288&pages=H1097-H1102");

if ($response->is_success)

{

$response->content =~ /<title>(.*?) - Google Scholar<\/title>/;

print $1;
}

else

{

die $response->status_line;

}

I am getting the below error while running this script.

403 Forbidden at D:\Getelement.pl line 52.

I have pasted this website address in address bar, and its redirecting exactly to that site, but its not working in while running by script.

Can you please help me on this issue.

Mat
  • 202,337
  • 40
  • 393
  • 406
Siva
  • 19
  • 4

3 Answers3

2

Google Terms of Service disallow automated searches. They are detecting you're sending this from a script because your headers and your browser standard headers are very different, and you can analyze them if you want.

In the old times they had a SOAP API, and you could use modules like WWW::Search::Google but that's not the case anymore because this API was deprecated.

Alternatives were already discussed in the following StackOverflow question:

Community
  • 1
  • 1
sidyll
  • 57,726
  • 14
  • 108
  • 151
0

Google has blacklisted LWP::UserAgent They either blacklisted the UserAgent or parts of the request (headers whatsoever).

I suggest you use Mojo::UserAgent.. The request looks like by default more like a browser. You must write minimum 1 lines of code.

use Mojo::UserAgent;
use strict;
use warnings;

print Mojo::UserAgent->new->get('https://scholar.google.co.in/scholar_lookup?author=N.+R.+Alpert&author=S.+A.+Mohiddin&author=D.+Tripodi&author=J.+Jacobson-Hatzell&author=K.+Vaughn-Whitley&author=C.+Brosseau+&publication_year=2005&title=Molecular+and+phenotypic+effects+of+heterozygous,+homozygous,+and+compound+heterozygote+myosin+heavy-chain+mutations&journal=Am.+J.+Physiol.+Heart+Circ.+Physiol.&volume=288&pages=H1097-H1102')->res->dom->at('title')->text;

# Prints Molecular and phenotypic effects of heterozygous, homozygous, and      
# compound heterozygote myosin heavy-chain mutations - Google Scholar

Terms

The code does not accept any terms nor additional lines has been added to bypass security checks. It's absolutely fine.

user3606329
  • 2,405
  • 1
  • 16
  • 28
  • Thanks for your reply. I would like to get the data which is present in the tag

    instead of . I have used the code "my $response = $UserAgent->get($xtx1)->res->dom->at('<h3 class="gs_rt">')->text;" but its getting error. Could you please suggest.</h3>

    – Siva Jan 06 '17 at 05:22
  • @Siva try $UserAgent->get($xtx1)->res->dom->at('h3.gs_rt')->text; PS: Don't forget to accept my answer, if my example suits your needs. :-) – user3606329 Jan 06 '17 at 07:23
  • Hi, its not working out, its printing the blank line – Siva Jan 06 '17 at 10:40
  • Then you gave me the wrong DOM. The HTML must look like this

    Text123

    , if there are more selectors such as span a href before the text, they need to be specified.
    – user3606329 Jan 06 '17 at 10:47
  • @Siva ->res->dom->at('h3.gs_rt a')->text; It prints Molecular and phenotypic effects of heterozygous, homozygous, and compound heter ozygote myosin heavy-chain mutations – user3606329 Jan 06 '17 at 10:48
0

You can fetch your content if you add a User Agent string to identify yourself to the web server:

...
my $UserAgent = LWP::UserAgent-new;
$UserAgent->agent('Mozilla/5.0'); #...add this...
...
print $1;
...

This prints: "Molecular and phenotypic effects of heterozygous, homozygous, and compound heterozygote myosin heavy-chain mutations"

JRFerguson
  • 7,426
  • 2
  • 32
  • 36