0

This is the website I'm trying to scrape from: https://www.premierleague.com/match/38413

I'm trying to get the table that has Match Stats, but when I try to scrape it, I only get the first line, which includes the team names only!

This is the code I'm using:

library(rvest)
url <- "https://www.premierleague.com/match/38413"

my_html <- read_html(url)


tbls_ls <- my_html %>%
  html_nodes("table") %>%
  .[2] %>%
  html_table(fill = TRUE)

I'm no R expert so I'm not really sure what I'm doing wrong, but would love assistance!

  • 3
    The problem is the table "Match Stats" is generated by `JavaScript`, i.e. `rvest` by itself is not able to scrape it. You need other tools, e.g. `RSelenium` (slow, very slow and not very stable), `PhantomJS` or `V8`. – niko Nov 09 '18 at 18:06
  • Most likely the data isn't actually in the page. Instead the page is loaded, and Javscript code then fetches the data and injects it into the page within the browser. The R page functions (as in rvest) only do the http GET for the page - they don't run any scripts in the page. – Andrew Lavers Nov 09 '18 at 18:07
  • Mmk, would I be able to acquire the information by going through the page source? By inspecting the page i‘ve been able to find all the information i need, but i dont know how to collect it all. – Kamaran McClanahan Nov 09 '18 at 18:10
  • Possible duplicate: https://stackoverflow.com/questions/41496552/extracting-html-table-from-a-website-in-r (it's the same website at least) – MrFlick Nov 09 '18 at 18:28

1 Answers1

0

I was able to extract the match stats with the following script :

library(RSelenium)
library(XML)
library(RCurl)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate('https://www.premierleague.com/match/38413')
remDr$screenshot(display = TRUE, useViewer = TRUE) 

# Close accept cookie
obj_Accept_Cookie <- remDr$findElement("xpath", "/html/body/div[3]/div/div/div[1]/div[5]/button[1]")
obj_Accept_Cookie$clickElement()

remDr$executeScript("scroll(0, 5000)")
remDr$executeScript("scroll(0, 15000)")

obj_Table_Stats <- remDr$findElement("xpath", "//*[@id='mainContent']/div/section[2]/div[2]/div[2]/div[1]/div/div/ul/li[3]")
obj_Table_Stats$clickElement()
remDr$screenshot(display = TRUE, useViewer = TRUE) 

page_Content <- remDr$getPageSource()[[1]]
table <- readHTMLTable(page_Content)[[3]]
table
Emmanuel Hamel
  • 1,769
  • 7
  • 19