How can I read and parse the contents of a webpage in R

Question

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it

score 35 · Accepted Answer · edited May 23 '17 at 12:25

Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.

Here's an example to get you started:

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

This results in a character vector of mostly just webpage text (along with some javascript):

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time:Â 16:48Â (EST+7)"           
[4] "Â Â Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()"

OOOhhhhh wow ... I am scrapping a dynamic website and I did everything in the past 7-8 hours and was not able to do it - This one worked for me. Life savior — Ali, May 23 '18 at 05:05

score 4 · Answer 2 · edited May 23 '17 at 12:09

4

Your best bet may be the XML package -- see for example this previous question.

edited May 23 '17 at 12:09

Community

1
1

answered Dec 04 '09 at 04:29

Dirk Eddelbuettel

360,940
56
644
725

But how can get rid of the html tags properly. I know I can write a RegEx expression but is there any package that make the coding less dramatic! – Mark Dec 04 '09 at 05:56

score 2 · Answer 3 · answered Dec 04 '09 at 16:14

2

I know you asked for R. But maybe python+beautifullsoup is the way forward here? Then do your analysis with R you have scraped the screen with beautifullsoup?

answered Dec 04 '09 at 16:14

Andreas

6,612
14
59
69

How can I read and parse the contents of a webpage in R

3 Answers3

Linked

Related