Retrieving Google Scholar IDs based on name information

Question

I'm a graduate student in economics and I'm currently working on a research project that involves Google Scholar. Though economists usually use Stata, the access to Google Scholar is made easier via R, so I've been learning how R works for the past week. Needless to say I'm a beginner and there are loads of things I don't really understand.

I managed to webscrape a list of economists and to generate a random sample from this list. I would now like to get some Google Scholar information concerning these academics. To do so, I plan on using the library 'scholar'.

My problem is that 'scholar' asks for Google Scholar IDs. I only have the name of the economists, so I would like to retrieve their IDs.

I basically want to do a google scholar query for each economist: https://scholar.google.fr/scholar?hl=fr&as_sdt=0%2C5&q="NAME OF THE ECONOMIST" and find in the html code the google scholar ID.

I tried with economist "Emmanuel Saez" to get started: https://scholar.google.fr/scholar?hl=fr&as_sdt=0%2C5&q=Emmanuel+Saez&btnG=

The relevant css node is: ".gs_rt2", so my code looks like:

page <- read_html("https://scholar.google.fr/scholar?hl=fr&as_sdt=0%2C5&q=Emmanuel+Saez&btnG=")
text <- html_nodes(page, ".gs_rt2")

The object "text" looks something like that:

[1] <h4 class="gs_rt2"><a href="/citations?user=qZpr_CQAAAAJ&amp;hl=fr&amp;oe=ASCII&amp;oi=ao"><b...

I'm just missing the last part: how do I tell R to select just the 12-char code after "user=" ?

It must be pretty obvious, but I just can't figure out how to do it. If someone can help me out that would be great.

Thanks, G. Gauthier

Maybe you can use some regexp? Have a look at `?gsub` or `?regexec`. Maybe the substring method of the package `stringr` (`?str_sub`) can come at hand as well — Bruno Zamengo, Oct 31 '17 at 11:53
Possible duplicate of [Extracting a string between other two strings in R](https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r) — Otto Kässi, Oct 31 '17 at 11:59

score 2 · Accepted Answer · answered Oct 31 '17 at 11:52

2

The simplest way is probably a regular expression. Something like:

user_code <- sub(".*user=([A-Za-z_]+)&.*", "\\1", text)

where "\1" is a backreference to the stuff in brackets. Try ?regexp and ?sub to find out more.

answered Oct 31 '17 at 11:52

Thank you. It works fine. I'll dig into 'sub' and 'regexp'. – G. Gauthier Oct 31 '17 at 13:14
Sure, didn't know I could ! Thanks again. – G. Gauthier Nov 01 '17 at 16:50

score 2 · Answer 2 · edited Oct 13 '19 at 03:53

2

Maybe missing something, but to get the id, it may be simpler to just use strsplit:

gsid <- strsplit(as.character(text),"(user=)|&")[[1]][2]

This returns the Google Scholar ID from the text (same as above).

edited Oct 13 '19 at 03:53

Ben

28,684
5
23
45

answered Oct 13 '19 at 03:35

psh

21
3

Retrieving Google Scholar IDs based on name information

2 Answers2