How do I scrape information from website source code/html using R?

Question

I am pretty new to webscraping and I am trying to build a scraper that accesses information in the website's source code/html using R.

Specifically, I want to be able to determine whether a (number of) website(s) has an id with a certain text: "google_ads_iframe". The id will always be longer than this, so I think I will have to use a wildcard.

I have tried several options (see below), but so far nothing has worked.

1st method:

doc <- htmlTreeParse("http://www.funda.nl/") 

data <- xpathSApply(doc, "//div[contains(@id, 'google_ads_iframe')]", xmlValue, trim = TRUE)

Error message reads:

Error in UseMethod("xpathApply") : 
  no applicable method for 'xpathApply' applied to an object of class "XMLDocumentContent"

2nd method:

scrapestuff <- scrape(url = "http://www.funda.nl/", parse = T, headers = T)

x <- xpathSApply(scrapestuff[[1]],"//div[contains(@class, 'google_ads_iframe')]",xmlValue)

x returns as an empty list.

3rd method:

scrapestuff <- read_html("http://www.funda.nl/")
hh <- htmlParse(scrapestuff, asText=T)
x <- xpathSApply(hh,"//div[contains(@id, 'google_ads_iframe')]",xmlValue)

Again, x is returned as an empty list.

I can't figure out what I am doing wrong, so any help would be really great!

The easy way to count the number of occurrences in text is by using `sum(stringr::str_count(scrapestuff, "google_ads_iframe"))` if scrapestuff could be character. Check [here](http://stackoverflow.com/a/15356666/5967807). — nya, Aug 23 '16 at 09:11
Actually, for HTML/XML the easy way to count the number of occurrences in HTML/XML is by using—for example—`xml_find_first(pg, "count(.//div[contains(@class, 'featured')])")` — hrbrmstr, Aug 23 '16 at 11:22

hrbrmstr · Accepted Answer · 2016-08-23T11:42:12.027

My ad blocker is probably preventing me from seeing google ads iframes, but you don't have to waste cycles with additional R functions to test for the presence of something. Let the optimized C functions in libxml2 (which underpins rvest and the xml2 package) do the work for you and just wrap your XPath with boolean():

library(xml2)

pg <- read_html("http://www.funda.nl/")

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])")
## [1] TRUE

xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])")
## [1] FALSE

One other issue you'll need to deal with is that the google ads iframes are most likely being generated after page-load with javascript, which means using RSelenium to grab the page source (you can then use this method with the resultant page source).

UPDATE

I found a page example with google_ads_iframe in it:

pg <- read_html("http://codepen.io/anon/pen/Jtizx.html")

xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] TRUE

xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] 3

That's a rendered page, though, and I suspect you'll still need to use RSelenium to do the page grabbing. Here's how to do that (if you're on a reasonable operating system and have phantomjs installed, otherwise use it with Firefox):

library(RSelenium)
RSelenium::startServer()
phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)
remDr$open()

remDr$navigate(URL)
raw_html <- remDr$getPageSource()[[1]]

pg <- read_html()
...

# eventually (when done)
phantom_js$stop()

NOTE

The XPath I used with the codepen example (since it has a google ads iframe) was necessary. Here's the snippet where the iframe exists:

<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;">
  <script type="text/javascript">
  googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); });
  </script>
  <iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:&quot;<html><body style='background:transparent'></body></html>&quot;" style="border: 0px; vertical-align: bottom;"></iframe></div>

The iframe tag is a child of the div so if you want to target the div first you then have to add the child target if you want to find an attribute in it.

Thanks, unfortunately it doesn't quite work. Your example runs fine. When use this as an example for the following, it doesn't and returns false, even though I know it should be true: xml_find_lgl(pg, "boolean(.//div[contains(@id, google_ads_iframe')])") — Sean de Hoon, Aug 23 '16 at 11:38
because the div does not contain that id, it contains an iframe with that id which is kinda why i used th example i did. — hrbrmstr, Aug 23 '16 at 11:40
In a second step I want to see whether a certain source is used for the ad and I am using an src. I tried the following, but it didn't work: xml_find_lgl(pg, "boolean(.//div[iframe[contains(@src, 'googleadservices.com')]])") — Sean de Hoon, Aug 23 '16 at 13:05
aren't those generally in the `img`or `script` tags? I know `iframe`s can have `src` attributes as well but for google ads they are usually `javascript` and not using that domain. — hrbrmstr, Aug 23 '16 at 13:24
I think the set up differs, but generally you are probably right. I tried script instead of iframe, but that didn't yield any results either. I also tried leaving out div. — Sean de Hoon, Aug 23 '16 at 13:30
I pasted the html here (for a different website, because it seemed like the former did not have the text i was looking for) http://pastebin.com/Jx2gQpsr — Sean de Hoon, Aug 23 '16 at 15:53
`xml_find_lgl(pg, "boolean(.//script[contains(@src, 'googleadservices.com')])")` seems to work — hrbrmstr, Aug 23 '16 at 16:05

How do I scrape information from website source code/html using R?

1 Answers1