5

Hello I'm new to using R to scrape data from the Internet and, sadly, know little about HTML and XML. Am trying to scrape each story link at the following parent page: http://www.who.int/csr/don/archive/year/2013/en/index.html. I don't care about any of the other links on the parent page, but need to create a table with a row for each story URL and columns for the corresponding URL, title of the story, date (it's always at the beginning of the first sentence following the story title), and then the rest of the text of the page (which can be several paragraphs of text).

I've tried to adapt the code at Scraping a wiki page for the "Periodic table" and all the links (and several related threads) but run into difficulties. Any advice or pointers would be gratefully appreciated. Here's what I've tried so far (with "?????" where I run into trouble):

rm(list=ls())
library(XML)
library(plyr) 

url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)

links = getNodeSet(doc, ?????)

df = ldply(doc, function(x) {
  text = xmlValue(x)
  if (text=='') text=NULL

  symbol = xmlGetAttr(x, '?????')
  link = xmlGetAttr(x, 'href')
  if (!is.null(text) & !is.null(symbol) & !is.null(link))
    data.frame(symbol, text, link)
} )

df = head(df, ?????)
Community
  • 1
  • 1
user2535366
  • 117
  • 6

1 Answers1

7

You can xpathSApply, (lapply equivalent), that search in your document given an Xpath.

library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
  dates =  xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
  story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))

 ##               dates                                                hrefs
## 1      26 June 2013             /entity/csr/don/2013_06_26/en/index.html
## 2      23 June 2013             /entity/csr/don/2013_06_23/en/index.html
## 3      22 June 2013             /entity/csr/don/2013_06_22/en/index.html
## 4      17 June 2013             /entity/csr/don/2013_06_17/en/index.html

##                                                                                    story
## 1                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update

EDIT: add the text of each story

dat$text = unlist(lapply(dat$hrefs,function(x)
  {
    url.story <- gsub('/entity','http://www.who.int',x)
    texts <- xpathSApply(htmlParse(url.story), 
                         '//*[@id="primary"]',xmlValue)
    }))
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • That's great, thank you so much! Any idea on how then to go to each url named in the 2nd column and save the text of that page as a 4th column of the data frame? – user2535366 Jun 30 '13 at 00:37
  • @user2535366 All the text ?:) Does this make sense? – agstudy Jun 30 '13 at 00:43
  • Well, probably not, but that never stopped me before :) Ultimately, I want to search each of these stories for certain terms, and this is what seemed like a convenient way to collect it all together. – user2535366 Jun 30 '13 at 00:47
  • @user2535366 Enjoy my edit. Advise , read wpath it is the way to go ( Nothing is magic, just some learning curve). You can also see [this](http://stackoverflow.com/questions/17309142/webscrape-text-using-logical-grep-in-r/17309502#17309502) on how to search for some words in your final text. – agstudy Jun 30 '13 at 01:09
  • I've employed the hints above successfully for several years of stories I wanted to scrape. One last question (hopefully), however. If you replace 2013 with the years 2011, 2008, and 2006, in the URL, the script fails and produces errors like, e.g. for 2011, "Error in data.frame(dates = xpathSApply(doc, "//*[@class=\"auto_archive\"]/li/a", : arguments imply differing number of rows: 59, 60". Looking at the raw HTML file at http://www.who.int/csr/don/archive/year/2011/en/index.html, I see no obvious difference in structure relative to the 2012 and 2013 URLs where the script works. Ideas? – user2535366 Jun 30 '13 at 17:12
  • Maybe it is preferable you ask a new question. You can reference this one. – agstudy Jun 30 '13 at 17:23