1

I am new to the web scraping topic with R and Rvest. With rvest you can scrape static HTML but I have found out that rvest struggeling to scrape data from heavy JS based Sites.

I found some articels or blog posts but they seems depricated like https://awesomeopensource.com/project/yusuzech/r-web-scraping-cheat-sheet

In my case i want scrape odds from Sport Betting Sites but with rvest and SelectorGadget this isnt possible in my Opinion because of the JS.

There is an Articel from 2018 about scraping Odds from PaddyPower(https://www.r-bloggers.com/how-to-scrape-data-from-a-javascript-website-with-r/) but this is out dated too, because PhantomJS isnt available anymore. RSelenium seems to be an option but the repo has many issues https://github.com/ropensci/RSelenium.

So is it possible to work with RSelenium in its current state or what options do I have instead of RSelenium?

kind regards

pontilicious
  • 239
  • 2
  • 12
  • 1
    Docker installation of RSelenium worked for me following this article : https://towardsdatascience.com/web-scraping-google-sheets-with-rselenium-9001eda399b0. Be aware that on some sites you might be confronted with an [anti-bot Captcha](https://stackoverflow.com/a/62455705/13513328) – Waldi Sep 13 '20 at 10:35
  • will try. hope its not to complicated... – pontilicious Sep 13 '20 at 10:44
  • 1
    Hi pontilicious. Just be careful that you are not violating the terms and conditions of the sports betting sites by scraping their data. I think most such sites would explicitly forbid that in their T&Cs – Allan Cameron Sep 13 '20 at 11:18

1 Answers1

1

I've had no problems using RSelenium with the help of the wdman package, which allowed me to just not bother with Docker. wdman also fetches all binaries you need if they aren't already available. It's nice magic.
Here's a simple script to spin up a Selenium instance with Chrome, open a site, get the contents as xml and then close it all down again.

library(wdman)
library(RSelenium)
library(xml2)

# start a selenium server with wdman, running the latest chrome version
selServ <- wdman::selenium(
  port = 4444L,
  version = 'latest',
  chromever = 'latest'
)

# start your chrome Driver on the selenium server
remDr <- remoteDriver(
  remoteServerAddr = 'localhost',
  port = 4444L,
  browserName = 'chrome'
)

# open a selenium browser tab
remDr$open()

# navigate to your site
remDr$navigate(some_url)

# get the html contents of that site as xml tree
page_xml <- xml2::read_html(remDr$getPageSource()[[1]])

# do your magic
# ... check doc at `?remoteDriver` to see what your remDr object can help you do.

# clean up after you
remDr$close()
selServ$stop()
alex_jwb90
  • 1,663
  • 1
  • 11
  • 20
  • thanks. works fine but how i say the browser to load the site with accepted or enabled JS? – pontilicious Sep 13 '20 at 19:55
  • Perhaps I'm still not understanding you fully. The Selenium-hosted Chrome is a fully fledged browser instance, it should've actually even opened a window on your machine where you can watch it in action. This browser supports JS just as your "normal" chrome does. Are you experiencing any different behavior or problems with JS execution? Theoretically, page_xml should give you the page content after it had been constructed by the site scripts – alex_jwb90 Sep 13 '20 at 19:58
  • this is the problem. the site isnt loaded in the chrome driver. in my chrome browser its loads correct. if i deactivate the js in my browser then there is the same problem in the browser, the site doesnt load. so it seems like the remote driver try to load the site without JS i think. – pontilicious Sep 13 '20 at 20:15
  • hey alex, sure. I want to scrape Odds from Bookie sites. So for example the Home, Draw , Away Odds from https://www.bet365.com/?nr=1#/AC/B1/U^92591234 – pontilicious Sep 14 '20 at 05:26
  • well, then its not a problem with JS execution in Selenium, it is merely your automated browser being blocked by the bookie sites. if you google it, you'll find discussions of that matter, [also some on SO](https://stackoverflow.com/a/62796313/1335174). It is against their TOS and thus I think you'll find it hard to get any explicit help here. As you'll find some suggestions that lean in that general direction: take a look [here](https://github.com/ropensci/RSelenium/issues/207) to learn how to pass additional config commands into the Chrome webdriver. Alex out :) – alex_jwb90 Sep 14 '20 at 11:55