1

I am trying to scrape data from several sister URLs for analysis. A previous thread Scraping a web page, links on a page, and forming a table with R was helpful in getting me on the right path with the following script:

rm(list=ls())
library(XML)
library(RCurl) 

#=======2013========================================================================
url2013 = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url2013)
dummy2013 <- data.frame(
  dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlGetAttr,'href'),
  title = xpathSApply(doc, '//*[@class="link_info"]/text()',  xmlValue)
)

dummy2013$text = unlist(lapply(dummy2013$hrefs,function(x)
{
  url.story <- gsub('/entity','http://www.who.int',x)
  texts <- xpathSApply(htmlParse(url.story), 
                       '//*[@id="primary"]',xmlValue)
}))

dummy2013$link <- gsub('/entity','http://www.who.int',dummy2013$hrefs)

write.csv(dummy2013, "whoDON2013.csv")

However, applied to sister URLs, things break. Trying

#=======2011========================================================================
url2011 = 'http://www.who.int/csr/don/archive/year/2011/en/index.html'
doc <- htmlParse(url2011)
dummy2011 <- data.frame(
  dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a', xmlGetAttr,'href'),
  title = xpathSApply(doc, '//*[@class="link_info"]/text()',  xmlValue)
)

for example, produces

## Error in data.frame(dates = xpathSApply(doc, "//*[@class=\"auto_archive\"]/li/a",  : 
  arguments imply differing number of rows: 59, 60

Similar errors occur for http://www.who.int/csr/don/archive/year/2008/en/index.html and http://www.who.int/csr/don/archive/year/2006/en/index.html. I'm not handy with HTML or XML; any ideas appreciated.

Community
  • 1
  • 1
user2535366
  • 117
  • 6
  • I guess The error occurs because for a date you have more than one story. The script assume that you have one to one relation story->date. – agstudy Jun 30 '13 at 17:57
  • 1
    You could probably leverage that stories are all grouped under `
  • ` to catch these kind of duplicate `"link_info"`'s. – Thomas Jun 30 '13 at 18:01
  • Thanks, though it's not clear to me that the duplicate dates are the issue -- the same script works for 2012 (http://www.who.int/csr/don/archive/year/2011/en/index.html), for example, and successfully scrapes three stories that occurred on 23 November. – user2535366 Jun 30 '13 at 18:15