1

New to programming and trying to scrap data from the below site. When I run the below code it returns an empty dataset or table. Any help or alternatives will be greatly appreciated.

url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003" 
tab <- url %>% read_html %>%  
  html_node("dogruns_wrapper") %>%  
  html_text()    
View(tab)

Have tried with xpath and same result and html_table() instead of text returns an error of no applicable method for 'html_table' applied to an object of class "xml_missing".

Phil
  • 7,287
  • 3
  • 36
  • 66
J. Doe
  • 11
  • 3
  • I think it can't be done using rvest because the table is generated via JavaScript. You should try with RSelenium/splashr or some other JavaScript rendering service. – Mislav Sep 11 '18 at 09:25
  • Thank you Mislav. I will look into those. – J. Doe Sep 12 '18 at 10:22

1 Answers1

2

As Mislav stated, the table is generated with JavaScript, so your best option is RSelenium.

In addition, if you want to get the table, you can get it with less code if you use html_table().

My try:

# Load packages
library(rvest) #Loading the rvest package
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the webpage

# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()

# define url
url <- "https://fasttrack.grv.org.au/Dog/Form?id=2003010003"

# go to website
remDr$navigate(url)

# as it's being loaded with JavaScript and it has a slow load, add a sleep here
Sys.sleep(10) # increase as needed

# get the html object of the webpage
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()

# read the table in the html_obj
tab <- html_obj %>%  html_table() %>% .[[1]]

Hope it helps! However, always check if webpages allow scraping before doing it! Check Terms and conditions:

Except for the direct purpose of viewing, printing, accessing or interacting with the Web Site for your own personal use or as otherwise indicated on the Web Site or these Terms and Conditions, you must not copy, reproduce, modify, communicate to the public, adapt, transfer, distribute, download or store any of the contents of the Web Site (including Race Information as described below), or incorporate any part of the Web Site into another web site without GRV’s written consent.

Unai Sanchez
  • 496
  • 1
  • 6
  • 14
  • Very nice Unai!!! Is this table generated dynamically with Javascript? I tested a few ideas that came to mind, but I couldn't get this working. I couldn't even get my code to recognize one single table on that URL, but obviously there is one. All I can think of is that Javascript pulls the data dynamically from the server and creates the table when the page loads, but I don't know for sure. Some more information about this would be great! Thanks! – ASH Sep 23 '18 at 12:00
  • Wow thank you Unai!! I was lost trying to figure this out after Mislav sent me in the right direction but you made it so simple thank you. Absolutely have read the t's and c's and this is only for personal use but thanks for looking out. – J. Doe Sep 23 '18 at 23:23
  • @ryguy72 I think so. It takes a lot to load, so that is my guess. I haven't checked the source code myself, so I can't be sure. What specific question do you have about this problem/solution? – Unai Sanchez Sep 24 '18 at 13:06
  • @J.Doe You're welcome! If there is anything that you don't understand about the solution, just tell me! If it worked for you, don't forget to mark my solution as the answer :) – Unai Sanchez Sep 24 '18 at 13:08
  • @UnaiSanchez getting this error on second line of code when entering second line; Error in shell(selCommand, wait = FALSE, minimized = TRUE) : could not find function "shell" Seem to be having trouble creating the remote driver. Is this because I am using a Mac? I can run a remote ecosystem through SplashR and Docker but haven't figured out any code for how to duplicate what you have done in RSelenium through SplashR yet. – J. Doe Sep 28 '18 at 02:26
  • @J.Doe check these links for alternative ways to starting `RSelenium`: [tutorial](https://ropensci.org/tutorials/rselenium_tutorial/) or [StackOverflow question](https://stackoverflow.com/questions/42468831/how-to-set-up-rselenium-for-r) – Unai Sanchez Oct 02 '18 at 08:04