23

I would like extract the first 100 results (say) of a Google Scholar search using R. Does anyone know how to do it?

To be precise, I just need the name of the paper, authors and citation count.

Ps Would this be legal?

John Conde
  • 217,595
  • 99
  • 455
  • 496
Manoel Galdino
  • 2,376
  • 6
  • 27
  • 40
  • 2
    It looks like Google scholar is lacking a [nice API](http://code.google.com/p/google-ajax-apis/issues/detail?id=109&colspec=ID%20Type%20Stars%20Status%20Modified%20Summary%20APIType%20Opened) – csgillespie Feb 15 '11 at 16:54
  • 1
    Re your PS: I have looked at the "about" page (http://scholar.google.ca/intl/en/scholar/about.html) and don't see any explicit terms of use – Ben Bolker Feb 15 '11 at 19:34
  • 1
    Also http://tonybreyal.wordpress.com/2011/11/08/web-scraping-google-scholar-partial-success/ – Ben Bolker Nov 09 '11 at 15:41
  • 1
    And the update: http://tonybreyal.wordpress.com/2011/11/08/web-scraping-google-scholar-part-2-complete-success/ – Ben Bolker Nov 09 '11 at 21:45
  • Not a strict answer, but I'd suggest learning Python for web scraping tasks. Even if you don't plan on using it for statistical programming, it's just a lot easier for scraping in my experience and has more references you can use. I spent the time to learn it on top of R, and definitely don't think that was time wasted. – verybadatthis Jun 28 '16 at 21:00
  • did anyone ever find an answer for this? – stats_noob Jul 21 '23 at 17:53

5 Answers5

6

please consider the updated biobucket-post:

http://thebiobucket.blogspot.com/2011/11/r-function-google-scholar-webscraper.html

Kay
  • 2,702
  • 6
  • 32
  • 48
  • sry, the script on theBioBucket is outdated due to changes on GoogleScholar - no idea when I get a chance to fix it.. – Kay Jun 28 '12 at 09:41
4

There are some Python and Perl scrapers out there that you might be able to adapt, linked at http://bmb-common.blogspot.com/2011/02/does-google-scholar-suck-or-am-i-just.html

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
3

I can't speak to the legalities of your task, but there are a few ways you can go about this. While I am not strong in XPath, it might be the best way. I believe that you can use the XML package to retrieve the page contents and use XPath to extract the data of the elements you need.

For instance, I use Chrome for a browser, and when I inspected the page with Developer Tools, there does appear to be a structure to the page, with the data "hidden" inside various tags that should you be able to exploit really easily using XPath.

Check out this link for an example of using XPath.

HTH and Good Luck

Community
  • 1
  • 1
Btibert3
  • 38,798
  • 44
  • 129
  • 168
3

You can definitely retrieve the HTML content of the page using RCurl and parse them using RXML as suggested by Btibert3. The only issue you might face is that Google won't allow you to do queries in a "robotic" way. After something like 200 queries in Google in a short period of time, it won't return results anymore. Maybe that's different with Google Scholar, but I doubt so...

Jean-Robert
  • 840
  • 6
  • 10
1

A solution was recently published here:

http://thebiobucket.blogspot.com/2011/11/visually-examine-google-scholar-search.html

Tal Galili
  • 24,605
  • 44
  • 129
  • 187