2

My ultimate goal is to web scrape the Standings page of The Puzzled Pint for Montreal.

I think I need to dynamically scrape (e.g. use RSelenium) since the table I'm interested in is a JavaScript iframe - part of a web page that displays content independent of its container.

Some have suggested that scraping directly from the source of these iframes is the way to go. I used the web developer Inspector tool in my firefox browser to find the src= which happens to be Google Sheets.

enter image description here

First, use robots.txt to make sure we're allowed to scrape it from Google Sheets:

library(robotstxt)
paths_allowed("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=203220308")

Now that I know I have permission, I tried the RCurl package. It's simple to get the first page:

library(RCurl)
sheet <- getForm("https://docs.google.com/spreadsheet/pub", hl = "en_US", key = "1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U", output = "csv", .opts = list(followlocation = TRUE, verbose = TRUE, ssl.verifypeer = FALSE))
df <- read.csv(textConnection(sheet))
head(df)

However, when you click any of the other Month/Year links on this Google Sheet the gid= of the url changes. For example, for October 2018 it's now:

https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=1367583807

I'm not sure if it's possible to scrape widget's with RCurl? If it is I'd love to hear how.

So it looks like I will most likely need to use RSelenium to do this.

library(RSelenium)
# connect to a running server
remDr <- remoteDriver(
  remoteServerAddr = "192.168.99.100",
  port = 4445L
)
remDr$open()
# navigate to the site of interest
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=203220308")

My problem is trying to get the HTML for the table on this page, the following was suggested on SO but doesn't work for me (It doesn't return the expected output, just Month/Year metadata from the links/elements)?

library(XML)
doc <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc)

I believe I need to navigate to the inner frame but not sure how to do this? For example, when looking for the CSS tag for this table with SelectorGadget in chrome it gives me a warning that it's an iframe and to be able to select within it I need to click a link.

enter image description here

When I use this link with readHTMLTable() I get the correct information I want:

remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pubhtml/sheet?headers=false&gid=203220308")
doc <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc)

enter image description here

This presents a problem as I need to use RSelenium to navigate through the different pages/tables of the previous link (the iframe widget):

remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=203220308")

To navigate through the different pages/tables I use SelectorGadget to find the CSS tags

# find all elements/links
webElems <- remDr$findElements(using = "css", ".switcherItem")
# Select the first link (October 2018)
webElem_01 <- webElems[[1]]

Then using TightVNC viewer I verified I was highlighting the correct element then "click" the element (in this case the October 2018 link).

webElem_01$highlightElement()

enter image description here

webElem_01$clickElement()

Since I can see that the page changed on TightVNC I assume there would be no more steps required before capturing/scraping here but as mentioned I need a way of programmatically navigating to the inner iframe of each of these pages.

UPDATE

Okay I figured out how to navigate to the inner frame using the remDr$switchToFrame() command but I cannot seem to figure out how to navigate back to the outer frame in order to "click" the next link and repeat the process. My current hacky attempt would involve me navigating back to the main page and repeating this process many times:

# navigate to the main page
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")
# look for table
tableElem <- remDr$findElement(using = "id", "pageswitcher-content")
# switch to table
remDr$switchToFrame(tableElem)
# parse html
doc <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc)

# how do I switch back to the outer frame?
# the remDr$goBack() command doesn't seem to do this

# workaround is to navigate back to the main page then navigate back to the second page and repeat process
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")
webElems <- remDr$findElements(using = "css", ".switcherItem")
webElem_01 <- webElems[[1]]
webElem_01$clickElement()
tableElem <- remDr$findElement(using = "id", "pageswitcher-content")
# switch to table
remDr$switchToFrame(tableElem)
# parse html
doc2 <- htmlParse(remDr$getPageSource()[[1]])
readHTMLTable(doc2)
  • 1
    You _could_ just request access to the google sheets. Most site owners are pretty friendly and if you explain your use-case you might not have to do any of this scraping at all. The non-HTML views for all those URLs are locked down (I checked :-) but all the content is CCA-NCSA-3.0 so they aren't restricting the use of the data. feedback@puzzledpint.com will likely be much quicker than the web spelunking (though you've done a job triaging. – hrbrmstr Nov 24 '18 at 20:53
  • Thanks for the suggestion to request access. I agree this would be the simplest way but this is also mostly for myself to learn RSelenium :) – Matthew J. Oldach Nov 24 '18 at 22:18

2 Answers2

2

Having noted what I did in the comment, this works for ~19 of the URLs:

library(googlesheets4)
library(rvest)

pg <- read_html("http://www.puzzledpint.com/standings/")

html_nodes(pg, xpath=".//iframe[contains(@src, 'sheet')]") %>% 
  html_attr("src") -> gsheet_urls

gsheet_urls[grepl("output=html", gsheet_urls, fixed=TRUE)] %>% 
  lapply(function(x) {
    read.csv(
      file =gsub("=true", "=false", gsub("=html", "=csv", x)),
      stringsAsFactors = FALSE
    )
  }) -> nineteen

str(nineteen, 1)
## List of 19
##  $ :'data.frame': 8 obs. of  6 variables:
##  $ :'data.frame': 37 obs. of  7 variables:
##  $ :'data.frame': 35 obs. of  6 variables:
##  $ :'data.frame': 62 obs. of  6 variables:
##  $ :'data.frame': 34 obs. of  6 variables:
##  $ :'data.frame': 30 obs. of  11 variables:
##  $ :'data.frame': 24 obs. of  6 variables:
##  $ :'data.frame': 11 obs. of  6 variables:
##  $ :'data.frame': 9 obs. of  6 variables:
##  $ :'data.frame': 13 obs. of  6 variables:
##  $ :'data.frame': 36 obs. of  6 variables:
##  $ :'data.frame': 9 obs. of  6 variables:
##  $ :'data.frame': 13 obs. of  6 variables:
##  $ :'data.frame': 29 obs. of  6 variables:
##  $ :'data.frame': 45 obs. of  6 variables:
##  $ :'data.frame': 34 obs. of  6 variables:
##  $ :'data.frame': 22 obs. of  6 variables:
##  $ :'data.frame': 3 obs. of  6 variables:
##  $ :'data.frame': 14 obs. of  6 variables:

All of the pubhtml ones have had CSV exporting expressly denied without permission.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • Thank's Bob! As the author of `splashr` (an alternative to `RSelenium` that I haven't tried yet) I'm wondering why you didn't suggest a solution with that tool? – Matthew J. Oldach Nov 25 '18 at 17:30
  • The `Rvest` + `googlesheets4` method only works for the ~30% of the tables (19/63), and only for the current month. Requesting access to the google sheets from PuzzledPint (IMO the *least fun* way) or my updated solution with `RSelenium` appear to be the best solutions at the moment – Matthew J. Oldach Nov 25 '18 at 17:33
  • 1
    Aye, I noted the "~19" in my answer. The reason for no `splashr` answer is that I'm not writing all the Lua code necessary for doing this :-) There are nuances to groups of tables, each requiring custom Selenium processing and/or custom `splashr` Splash Lua code. That's asking quite a bit of an SO contributor IMO. – hrbrmstr Nov 25 '18 at 17:40
  • That's fair enough. I was wondering if it was the length/complexity of code that it would require or if it was for some other reason (e.g. akin to using a shovel to kill a fly). I appreciate you taking the time to respond. – Matthew J. Oldach Nov 26 '18 at 20:17
0

you don't need selenium just extract gid or pageUrl from the page source

https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true

example code

items.push(
{
  name: "November 2018",
  pageUrl: "https:\/\/docs.google.comm\/.....&gid=690408156",
  gid: "690408156",
  initialSheet: ("690408156" == gid)
});
items.push(
{
  name: "October 2018",
  pageUrl: "https:\/\/docs.google.com\/.....&gid=1367583807",
  gid: "1367583807",
  initialSheet: ("1367583807" == gid)
});

result urls:

https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pubhtml/sheet?headers=false&gid=690408156
https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pubhtml/sheet?headers=false&gid=1367583807
ewwink
  • 18,382
  • 2
  • 44
  • 54
  • This doesn't look like R code to me, did you use the terminal? How did you go about finding these `gid` numbers from the page source `programmatically`? So say, for example, next month a new `gid` was generated a `regex` search grabs this and appends it to your `items.push()` code. – Matthew J. Oldach Nov 25 '18 at 17:04