8

I would like to fetch some data from Google Scholar automatically via a matlab script. I am mostly interested in data like Google Scholar's Bibtex entries and the forward citation feature. However, it seems that there is no API for Google Scholar, is there a way to automatically fetch bibliographic data from Google Scholar using Matlab? Are there some tools or code already available for this?

  • 1
    Since there is no API and no structured format either, you'll end up with lot of duplicates and there is no good way to extract the data reliably. [Here's the same question](http://stackoverflow.com/questions/6109520/can-anybody-share-a-simple-example-of-using-mma-and-google-to-extract-academic-re) but with Mathematica. Sjoerd C. deVries shows in his answer how it can result in a lot of dubious results. – abcd Sep 23 '11 at 14:10
  • @yoda I am building this tool mostly because I am leading a survey team (and then later for my own use) and this is a nice way to make sure we didn't miss any important papers out there. If there are duplicates it is fine since we will mostly be looking at human sized chunks of data in the end. However, if you know of better approaches than fighting with Google Scholar then I would really like to know about that, too. – Artem Kaznatcheev Sep 23 '11 at 14:24
  • 1
    I would suggest trying a publication database that is well known in your field of study. For example, IEEE Explore/SPIRE/WebOfScience/ScienceDirect/CiteSeer, etc. I believe most of them have APIs, but all of them are commercial and have high fees, so if your intent was to develop a low cost/free tool, then these might not be helpful. I think it still is possible with Google scholar, just that it requires a lot more effort due to the lack of structure. Nevertheless "Papers" an app for macs manages to return decent results from Google Scholar, so it is not impossible :) – abcd Sep 23 '11 at 15:00

2 Answers2

8

A word of caution I found while working further on this project.

There is a reason why Google Scholar does not have an API. Using bots to collect from Google Scholar is against the EULA. The basic idea is that any program that tries to interface with Google Scholar cannot do so in a qualitatively different way than an end user. In other words, you can automatically fetch large amounts of data. Although the script in @JustinPeel's answer do not necessarily violate the terms, putting it in a massive loop, would.

Some specific points from this EULA:

You shall not, and shall not allow any third party to: ...

(i) directly or indirectly generate queries, or impressions of or clicks on Results, through any automated, deceptive, fraudulent or other invalid means (including, but not limited to, click spam, robots, macro programs, and Internet agents);

...

(l) "crawl", "spider", index or in any non-transitory manner store or cache information obtained from the Service (including, but not limited to, Results, or any part, copy or derivative thereof);

If you look at the Google Scholar robots.txt then you can also see that no bots of any kind are allowed.

I have heard from some colleagues that you will get in trouble if you try to circumvent this policy, which can result in your lab losing access to Google Scholar.

Community
  • 1
  • 1
4

If you really want to use Matlab for this (which I don't really advise), then you can look at some various web scraping examples and there is this code that actually already gets some info from Google Scholar. Basically, just good 'matlab web scraping' and off you go.

I personally would recommend using Python for this because Python is better for general programming IMHO. For instance, this guy has already done a similar thing to what you want with Python. However, if you know Matlab and don't have any interest/time for Python then follow the links in the first paragraph.

sunny
  • 3,853
  • 5
  • 32
  • 62
Justin Peel
  • 46,722
  • 6
  • 58
  • 80