Scraping related pages in R

Question

I am trying to scrape data from several sister URLs for analysis. A previous thread Scraping a web page, links on a page, and forming a table with R was helpful in getting me on the right path with the following script:

rm(list=ls())
library(XML)
library(RCurl) 

#=======2013========================================================================
url2013 = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url2013)
dummy2013 <- data.frame(
  dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlGetAttr,'href'),
  title = xpathSApply(doc, '//*[@class="link_info"]/text()',  xmlValue)
)

dummy2013$text = unlist(lapply(dummy2013$hrefs,function(x)
{
  url.story <- gsub('/entity','http://www.who.int',x)
  texts <- xpathSApply(htmlParse(url.story), 
                       '//*[@id="primary"]',xmlValue)
}))

dummy2013$link <- gsub('/entity','http://www.who.int',dummy2013$hrefs)

write.csv(dummy2013, "whoDON2013.csv")

However, applied to sister URLs, things break. Trying

#=======2011========================================================================
url2011 = 'http://www.who.int/csr/don/archive/year/2011/en/index.html'
doc <- htmlParse(url2011)
dummy2011 <- data.frame(
  dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlGetAttr,'href'),
  title = xpathSApply(doc, '//*[@class="link_info"]/text()',  xmlValue)
)

for example, produces

## Error in data.frame(dates = xpathSApply(doc, "//*[@class=\"auto_archive\"]/li/a",  : 
  arguments imply differing number of rows: 59, 60

Similar errors occur for http://www.who.int/csr/don/archive/year/2008/en/index.html and http://www.who.int/csr/don/archive/year/2006/en/index.html. I'm not handy with HTML or XML; any ideas appreciated.

I guess The error occurs because for a date you have more than one story. The script assume that you have one to one relation story->date. — agstudy, Jun 30 '13 at 17:57
You could probably leverage that stories are all grouped under `
Thanks, though it's not clear to me that the duplicate dates are the issue -- the same script works for 2012 (http://www.who.int/csr/don/archive/year/2011/en/index.html), for example, and successfully scrapes three stories that occurred on 23 November. — user2535366, Jun 30 '13 at 18:15

user1609452 · Accepted Answer · 2013-07-01T08:54:18.037

1

You can select the titles first then find the href associated with them

require(XML)
url2011 = 'http://www.who.int/csr/don/archive/year/2011/en/index.html'
doc <- htmlParse(url2011)
titleNodes <- getNodeSet(doc, '//*[@class="link_info"]')
hrefNodes <- sapply(titleNodes, getNodeSet, path = './preceding-sibling::a')

dummy2011 <- data.frame(
    dates = sapply(hrefNodes, xmlValue),
    hrefs = sapply(hrefNodes, xmlAttrs),
    title = sapply(titleNodes, xmlValue),
    stringsAsFactors = FALSE
)

UPDATE:

to remove duplicate values you can use

dummy2011 <- dummy2011[!duplicated(dummy2011$hrefs),]

edited Jul 01 '13 at 08:54

answered Jun 30 '13 at 19:06

user1609452

4,406
1
15
20

@user2535366 +1! good idea to use the preceeding value, maybe you can add th length of the result... – agstudy Jun 30 '13 at 20:35

score 0 · Answer 2 · answered Jun 30 '13 at 19:39

After looking more carefully at the HTML code in question I discovered some inconsistencies that tripped up the scripts I was applying. For the record, the following works (admittedly ugly and ad hoc-ish -- but it gets the job done):

#=======2011========================================================================
url2011 = 'http://www.who.int/csr/don/archive/year/2011/en/index.html'
doc <- htmlParse(url2011)

dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue)
hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href')
title = xpathSApply(doc, '//*[@class="link_info"]/text()' ,xmlValue)
title[5] <- "Influenza like illness in the United States of America Revised 7 December 2011"
title = title[-6]

dummy2011 <- data.frame(
  dates,
  hrefs,
  title
)

Thank you to those who jumped in to help me see my way through this, it's much appreciated.

I have added an update to my solution which addresses your needs. — user1609452, Jul 01 '13 at 08:55
@user2535366 I think it is better and safer to use/try user1609452 solution (his last edit) and maybe you can accept his solution.. — agstudy, Jul 06 '13 at 07:40

Scraping related pages in R

2 Answers2

Linked