12

Using google-scholar and R, I'd like to find out who is citing a particular paper.

The existing packages (like scholar) are oriented towards H-index analyses: statistics on a researcher.

I want to give a target-paper as input. An example url would be:

https://scholar.google.co.uk/scholar?oi=bibs&hl=en&cites=12939847369066114508

Then R should scrape these citations pages (google scholar paginates these) for the paper, returning an array of papers which cite the target (up to 500 or more citations). Then we'd search for keywords in the titles, tabulate journals and citing authors etc.

Any clues as to how to do that? Or is it down to literally scraping each page? (which I can do with copy and paste for one-off operations).

Seems like this should be a generally useful function for things like seeding systematic reviews as well, so someone adding this to a package might well increase their H :-)

tim
  • 3,559
  • 1
  • 33
  • 46
  • 2
    [This post](http://stackoverflow.com/q/22657548/489704) about Google's ToS is probably relevant. – jbaums Mar 13 '15 at 10:17
  • 1
    Maybe you could consider searching papers on web of science instead. You can download search results, up to 500. Then process them in R. To get some inspiration: http://www.jameskeirstead.ca/blog/how-to-do-a-quantitative-literature-review-in-r/ – Kvasir EnDevenir Mar 13 '15 at 10:20
  • Thanks @jbaums You are right. This use would be v.low volume, in keeping with existing packages. – tim Mar 13 '15 at 10:22
  • Thanks @KvasirEnDevenir Would like to stay in the wonderful google scholar system so people without a WoS subscription could use too – tim Mar 13 '15 at 10:23
  • @Kay has worked on this stuff in the past and has some code on [his website](http://thebiobucket.blogspot.com.au/search?q=+scholar). I haven't tested it recently. – jbaums Mar 13 '15 at 10:41
  • 3
    @tim This definitely can help you http://simplystatistics.tumblr.com/post/13203811645/an-r-function-to-analyze-your-google-scholar –  Mar 13 '15 at 11:10
  • @nemo that set of functions is like dinner beer packages: oriented to analyzing an author, not the citations of a paper – tim Mar 15 '15 at 10:24
  • This is a very well commented github which will scrape details of papers. https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper.R I think my ideal scraper would use DOIs or similar instead of URLs, but this should definitely give you something to build from. – verybadatthis Jun 28 '16 at 20:56

2 Answers2

1

Although there's is a bunch of available Google's API, a google scholar-based API is not available. So, albeit a web crawler on google scholar pages might not be difficult to develop, I do not know to what extent it might be illegal. Check this.

Ulises Rosas-Puchuri
  • 1,900
  • 10
  • 12
1

Alternatively, you could use a third party solution like SerpApi. It's a paid API with a free trial. We handle proxies, solve captchas, and parse all rich structured data for you.

Example python code (available in other libraries also):

from serpapi import GoogleSearch

params = {
  "api_key": "secret_api_key",
  "engine": "google_scholar",
  "hl": "en",
  "cites": "12939847369066114508"
}

search = GoogleSearch(params)
results = search.get_dict()

Example JSON output:

{
  "position": 1,
  "title": "Lavaan: An R package for structural equation modeling and more. Version 0.5–12 (BETA)",
  "result_id": "HYlMgouq9VcJ",
  "type": "Pdf",
  "link": "https://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf",
  "snippet": "Abstract In this document, we illustrate the use of lavaan by providing several examples. If you are new to lavaan, this is the first document to read … 3.1 Entering the model syntax as a string literal … 3.2 Reading the model syntax from an external file …",
  "publication_info": {
    "summary": "Y Rosseel - Journal of statistical software, 2012 - users.ugent.be",
    "authors": [
      {
        "name": "Y Rosseel",
        "link": "https://scholar.google.com/citations?user=0R_YqcMAAAAJ&hl=en&oi=sra",
        "serpapi_scholar_link": "https://serpapi.com/search.json?author_id=0R_YqcMAAAAJ&engine=google_scholar_author&hl=en",
        "author_id": "0R_YqcMAAAAJ"
      }
    ]
  },
  "resources": [
    {
      "title": "ugent.be",
      "file_format": "PDF",
      "link": "https://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf"
    }
  ],
  "inline_links": {
    "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=HYlMgouq9VcJ",
    "cited_by": {
      "total": 10913,
      "link": "https://scholar.google.com/scholar?cites=6338159566757071133&as_sdt=2005&sciodt=0,5&hl=en",
      "cites_id": "6338159566757071133",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cites=6338159566757071133&engine=google_scholar&hl=en"
    },
    "related_pages_link": "https://scholar.google.com/scholar?q=related:HYlMgouq9VcJ:scholar.google.com/&scioq=&hl=en&as_sdt=2005&sciodt=0,5",
    "versions": {
      "total": 27,
      "link": "https://scholar.google.com/scholar?cluster=6338159566757071133&hl=en&as_sdt=2005&sciodt=0,5",
      "cluster_id": "6338159566757071133",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cluster=6338159566757071133&engine=google_scholar&hl=en"
    },
    "cached_page_link": "https://scholar.googleusercontent.com/scholar?q=cache:HYlMgouq9VcJ:scholar.google.com/&hl=en&as_sdt=2005&sciodt=0,5"
  }
},
...

Check out the documentation for more details.

Disclaimer: I work at SerpApi.

Milos Djurdjevic
  • 364
  • 1
  • 11