3

Trying to scrape wikipedia page, the like of which i have done many times before

library(XML)
myURL <- "http://en.wikipedia.org/wiki/List_of_US_Open_MenUs_Singles_champions"
y <- readHTMLTable(myURL,  stringsAsFactors = FALSE)

R crashes either in RStudio or the standard GUI

Other SO comments on similar problem suggested use of readLines

u=url(myURL)
readLines(u) #  cannot open: HTTP status was '404 Not Found'

The url is actually redirected so entered the final URL

myURL <- "http://en.wikipedia.org/wiki/List_of_US_Open_Men%27s_Singles_champions"

This time readLines does output the page but use of XML functions, including htmlParse, still cause crash

TIA

pssguy
  • 3,455
  • 7
  • 38
  • 68
  • There is indeed a bug in the `XML` package, possibly in `RS_XML_ParseTree`, as indicated by @benbolker in a comment to my answer. – Andrie Sep 11 '12 at 15:12

1 Answers1

3

I have found the package httr invaluable in solving any web scraping problem. In this case, you need to add a user agent profile, since Wikipedia blocks the content if you don't:

library(httr)
library(XML)
myURL <- "http://en.wikipedia.org/wiki/List_of_US_Open_Men%27s_Singles_champions"
page <- GET(myURL, user_agent("httr"))
x <- readHTMLTable(text_content(page), as.data.frame=TRUE)
head(x[[1]])

Produces this:

  US Open Men's Singles Champions                                                          NA
1                Official website                                                        <NA>
2                        Location                        Queens – New York City United States
3                           Venue                USTA Billie Jean King National Tennis Center
4                  Governing body                                                        USTA
5                         Created 1881 (established)Open Era: 1968\n(44 editions, until 2011)
6                         Surface  Grass (1881–1974)HarTru (1975–1977)DecoTurf (1978–Present)
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • yes, although I can confirm a crash via segmentation fault with the OP's code, which indicates a True Bug somewhere (`RS_XML_ParseTree` is the proximal cause ...). Probably worth an e-mail to the maintainers. – Ben Bolker Sep 11 '12 at 15:10
  • @BenBolker Sorry, yes. I can also confirm that the original code crashed on me, hence using `httr` which at least returned some results. – Andrie Sep 11 '12 at 15:11
  • @Andrie. Thanks I will check out httr. It is odd that a v similar URL does not cause crash http://en.wikipedia.org/wiki/List_of_Wimbledon_gentlemen%27s_singles_champions – pssguy Sep 11 '12 at 15:34