2

I'm a graduate student in economics and I'm currently working on a research project that involves Google Scholar. Though economists usually use Stata, the access to Google Scholar is made easier via R, so I've been learning how R works for the past week. Needless to say I'm a beginner and there are loads of things I don't really understand.

I managed to webscrape a list of economists and to generate a random sample from this list. I would now like to get some Google Scholar information concerning these academics. To do so, I plan on using the library 'scholar'.

My problem is that 'scholar' asks for Google Scholar IDs. I only have the name of the economists, so I would like to retrieve their IDs.

I basically want to do a google scholar query for each economist: https://scholar.google.fr/scholar?hl=fr&as_sdt=0%2C5&q="NAME OF THE ECONOMIST" and find in the html code the google scholar ID.

I tried with economist "Emmanuel Saez" to get started: https://scholar.google.fr/scholar?hl=fr&as_sdt=0%2C5&q=Emmanuel+Saez&btnG=

The relevant css node is: ".gs_rt2", so my code looks like:

page <- read_html("https://scholar.google.fr/scholar?hl=fr&as_sdt=0%2C5&q=Emmanuel+Saez&btnG=")
text <- html_nodes(page, ".gs_rt2")

The object "text" looks something like that:

[1] <h4 class="gs_rt2"><a href="/citations?user=qZpr_CQAAAAJ&amp;hl=fr&amp;oe=ASCII&amp;oi=ao"><b...

I'm just missing the last part: how do I tell R to select just the 12-char code after "user=" ?

It must be pretty obvious, but I just can't figure out how to do it. If someone can help me out that would be great.

Thanks, G. Gauthier

  • 1
    Maybe you can use some regexp? Have a look at `?gsub` or `?regexec`. Maybe the substring method of the package `stringr` (`?str_sub`) can come at hand as well – Bruno Zamengo Oct 31 '17 at 11:53
  • 4
    Possible duplicate of [Extracting a string between other two strings in R](https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r) – Otto Kässi Oct 31 '17 at 11:59

2 Answers2

2

The simplest way is probably a regular expression. Something like:

user_code <- sub(".*user=([A-Za-z_]+)&.*", "\\1", text)

where "\1" is a backreference to the stuff in brackets. Try ?regexp and ?sub to find out more.

2

Maybe missing something, but to get the id, it may be simpler to just use strsplit:

gsid <- strsplit(as.character(text),"(user=)|&")[[1]][2]

This returns the Google Scholar ID from the text (same as above).

Ben
  • 28,684
  • 5
  • 23
  • 45
psh
  • 21
  • 3