Regex pattern match in a character

Question

I am new to R so I apologize if this is easy and straight forward. I have successfully read a web page into a character vector. I want to strip this string down to a smaller segment so I can extract some data. So far, so easy.

The problem is that I am new to regex and R, so this has been pretty hard for me. I simply want to shorten the string such that it includes everything between the

<div class="appForm"

and 

</div>

For some reason, I am having a hard time using the stringr package and ?str_match.

Any help - more efficient solutions - will be very much appreciated. A newbie at web scraping, but determined to stay within R.

score 5 · Answer 1 · edited May 23 '17 at 10:27

5

Some in the community heavily discourage the use of regular expressions to parse text containing an arbitrary number of nested expressions. R does have an XML parser (also applicable for HTML) which you might consider using for this purpose.

edited May 23 '17 at 10:27

Community

1
1

answered Aug 20 '10 at 03:58

hatmatrix

42,883
45
137
231

Thanks for the help, but I know very little about HTML or XML and am not sure where to begin. How would I go about scraping the data contained inside ul tags but found in the named div tag above? – Btibert3 Aug 20 '10 at 12:43
@Btibert3 - that's an easy problem for XPath. You can specify which nodes from the XML parse tree you want to get information from, and then specify the type of information (attributes or text). Take a look at the R functions I use below, and the XPath string. Here's where to start: http://www.w3schools.com/xpath/xpath_nodes.asp – Vince Aug 20 '10 at 14:55
@Vince - getting an error telling me it couldn't find function clean? Thx – Btibert3 Aug 20 '10 at 15:14
`clean` was my function that did some text cleaning, sorry. Look at the example I just added. – Vince Aug 20 '10 at 15:24

Vince · Accepted Answer · 2010-08-20T15:23:58.503

I suggest using the XML package and XPath. This requires some learning, but if you're serious about web scraping, it's the way to go. I did this with some county level elections data from the NY Times website ages ago, and the code looked something like this (just to give you an idea):

getCounty <- function(url) {
    doc = htmlTreeParse(url, useInternalNodes = TRUE)

    nodes <- getNodeSet(doc, "//tr/td[@class='county-name']/text()")
    tmp <- sapply(nodes, xmlValue)
    county <- sapply(tmp, function(x) clean(x, num=FALSE))

    return(county)
}

You can learn about XPath here.

Another example: grab all R package names from the Crantastic timeline. This looks for a div node with the id "timeline", then looks for the ul with the class "timeline", and extracts all of the first a nodes from the parent node, and returns their text:

url <- 'http://crantastic.org/'
doc = htmlTreeParse(url, useInternalNodes = TRUE)

nodes <- getNodeSet(doc, "//div[@id='timeline']/ul[@class='timeline']/li/a[1]/text()")
tmp <- sapply(nodes, xmlValue)
tmp

>  [1] "landis"          "vegan"           "mutossGUI"       "lordif"         
 [5] "futile.paradigm" "lme4"            "tm"              "qpcR"           
 [9] "igraph"          "aspace"          "ade4"            "MCMCglmm"       
[13] "hts"             "emdbook"         "DCGL"            "wq"             
[17] "crantastic"      "Psychometrics"   "crantastic"      "gR"             
[21] "crantastic"      "Distributions"   "rAverage"        "spikeslab"      
[25] "sem"

@Btibert3, I tweeted this item and someone responded suggesting this: http://twitter.com/cocteau/status/21673265592 This may be useful for you. — Vince, Aug 20 '10 at 15:40

score 2 · Answer 3 · edited May 23 '17 at 11:47

2

I second Stephen and Vince's advice to use the htmlTreeParse in the XML package. There are quite a few SO questions related to scraping/using HTML content in R, based on this idea. Take a look at

Scraping html tables into R data frames using the XML package

How can I use R (Rcurl/XML packages ?!) to scrape this webpage ?

How to isolate a single element from a scraped web page in R

How to transform XML data into a data.frame?

edited May 23 '17 at 11:47

Community

1
1

answered Aug 20 '10 at 12:49

Richie Cotton

118,240
47
247
360

Those questions are great! I saw a few of them before and already tried modifying some of the examples to fit my particular problem to no avail, hence my attempt to just use regex. The website is http://collegesearch.collegeboard.com/search/CollegeDetail.jsp?collegeId=885 and I want to grab not only the names in the box "More to explore", but the ids that are found in the links as well. – Btibert3 Aug 20 '10 at 13:10

Regex pattern match in a character

3 Answers3