My ad blocker is probably preventing me from seeing google ads iframes, but you don't have to waste cycles with additional R functions to test for the presence of something. Let the optimized C functions in libxml2
(which underpins rvest
and the xml2
package) do the work for you and just wrap your XPath with boolean()
:
library(xml2)
pg <- read_html("http://www.funda.nl/")
xml_find_lgl(pg, "boolean(.//div[contains(@class, 'featured')])")
## [1] TRUE
xml_find_lgl(pg, "boolean(.//div[contains(@class, 'futured')])")
## [1] FALSE
One other issue you'll need to deal with is that the google ads iframes are most likely being generated after page-load with javascript, which means using RSelenium to grab the page source (you can then use this method with the resultant page source).
UPDATE
I found a page example with google_ads_iframe
in it:
pg <- read_html("http://codepen.io/anon/pen/Jtizx.html")
xml_find_lgl(pg, "boolean(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] TRUE
xml_find_first(pg, "count(.//div[iframe[contains(@id, 'google_ads_iframe')]])")
## [1] 3
That's a rendered page, though, and I suspect you'll still need to use RSelenium to do the page grabbing. Here's how to do that (if you're on a reasonable operating system and have phantomjs installed, otherwise use it with Firefox):
library(RSelenium)
RSelenium::startServer()
phantom_js <- phantom(pjs_cmd='/usr/local/bin/phantomjs', extras=c("--ssl-protocol=any"))
capabilities <- list(phantomjs.page.settings.userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.3")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities=capabilities)
remDr$open()
remDr$navigate(URL)
raw_html <- remDr$getPageSource()[[1]]
pg <- read_html()
...
# eventually (when done)
phantom_js$stop()
NOTE
The XPath I used with the codepen example (since it has a google ads iframe) was necessary. Here's the snippet where the iframe exists:
<div id="div-gpt-ad-1379506098645-3" style="width:720px;margin-left:auto;margin-right:auto;display:none;">
<script type="text/javascript">
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1379506098645-3'); });
</script>
<iframe id="google_ads_iframe_/16833175/SmallPS_0" name="google_ads_iframe_/16833175/SmallPS_0" width="723" height="170" scrolling="no" marginwidth="0" marginheight="0" frameborder="0" src="javascript:"<html><body style='background:transparent'></body></html>"" style="border: 0px; vertical-align: bottom;"></iframe></div>
The iframe
tag is a child of the div
so if you want to target the div
first you then have to add the child target if you want to find an attribute in it.