0

Using R and the XML package, I have parsed an ("HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument") object using the XML htmlParse function. The line in the xml object, see below, that I am interested in contains two values that I would like returned.

Besides the value from class=gsc_1usr_name (returns "Konrad Wrzecionkowski"), I need to pull the the value under "user=" which is, in this case, "QnVgFlYAAAAJ". I have tried several syntax variations with xpathSApply and it always return a NULL. Admittedly, I am pretty clueless when it comes to xml, any ideas? Is there a way I can coerce this to a different object class, such as list, and then use split on a vector? Standard coercion (eg., as.list, as.character) do not seem to work on this object class.

search.page <- "http://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=GVN Powell World Wildlife Fund"
x <- XML::htmlParse(search.page, encoding="UTF-8")

Which returns an xml object, below is a subset of a single entry, out of 10. The h3 class="gsc_1usr_name line contains the values, in each entry, that I would like to retrieve (for all 10).

</div>
</div>
<div class="gsc_1usr gs_scl">
<div class="gsc_1usr_photo"><a href="/citations?user=QnVgFlYAAAAJ&amp;hl=en&amp;oe=ASCII"><img src="/citations?view_op=view_photo&amp;user=QnVgFlYAAAAJ&amp;citpid=3" sizes="(max-width:599px) 75px,(max-width:1251px) 100px, 120px" srcset="/citations?view_op=view_photo&amp;user=QnVgFlYAAAAJ&amp;citpid=3 128w,/citations?view_op=medium_photo&amp;user=QnVgFlYAAAAJ&amp;citpid=3 256w" alt="Konrad Wrzecionkowski"></a></div>
<div class="gsc_1usr_text">
<h3 class="gsc_1usr_name"><a href="/citations?user=QnVgFlYAAAAJ&amp;hl=en&amp;oe=ASCII">Konrad Wrzecionkowski</a></h3>
<div class="gsc_1usr_aff">Zachodniopomorski Uniwersytet Technologiczny w Szczecinie, Błękitny Patrol <span class="gs_hlt">WWF </span>Polska</div>
<div class="gsc_1usr_eml">Verified email at <span class="gs_hlt">wwf</span>.pl</div>
<div class="gsc_1usr_emlb">@wwf.pl</div>
<div class="gsc_1usr_int">
<a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=en&amp;oe=ASCII&amp;mauthors=label:ichtiologia_ochrona_przyrody">ichtiologia / ochrona przyrody</a> </div>
</div>
</div>

Using the following syntax for the xpathSApply function I return "GVN Powell" but would also like the value from user=. I have tried variations of h3[@user=''] including sub queries of class but, cannot get anything else to work.

XML::xpathSApply(x, "//h3[@class='gsc_1usr_name']", xmlValue)

The approach that I have been using has been with url and readLines. I then use strsplit to pull the desired value.

auth.names <- "Konrad Wrzecionkowski WWF"    
search.page <- paste("http://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=", auth.names, sep="")

x <- readLines(url(search.page))
x <- strsplit(x[[1]], split="user=")[[1]][2]
x <- strsplit(x, split="&amp;")[[1]][1]

The problem here is that Google Scholar does not seem to like web scraping and the code periodically fails with a "Cannot open connection, HTTP status was '503 Service Unavailable" error. However, this does not seem to be the case with htmlParse.

Jeffrey Evans
  • 2,325
  • 12
  • 18

1 Answers1

1
library(rvest)
library(magrittr)

url <- "http://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=GVN Powell World Wildlife Fund"
xpath = "//*[@id=\"gsc_ccl\"]/div[1]/div[2]/h3/a/span"

gvn.powell <- url %>%
  read_html %>%
  html_nodes(xpath = xpath) %>%
  html_text

gvn.powell
#[1] "GVN Powell"