3

I want to scrape a table like this http://www.oddsportal.com//hockey/usa/nhl/carolina-hurricanes-ottawa-senators-80YZhBGC/ I'd want to scrape the bookmakers and the odds. The problem is I don't know what kind of a table that is nor how to scrape it.

These threads might be able to help me (Scraping javascript with R or What type of HTML table is this and what type of webscraping techniques can you use?) but I'd appreciate if someone could point me in the right direction or better yet give instructions here.

So what kind of a table is that odds table, is it possible to scrape it with R and if so, how?

Edit: I should have been more clear. I have scraped data with R for some time now and probably dont need help with basics. After further inspection that table is indeed Javascript and that is the problem and what I need help with

Community
  • 1
  • 1
lunatus
  • 51
  • 1
  • 4
  • Take a look at [Scraping html tables into R data frames using the XML package](http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package) – rrs Jun 20 '14 at 15:20
  • I have read that thread pretty extensively but it hasn't helped me with scraping that specific table. I have scraped several other tables easily with those instructions. If I for example read the url with tables <- readHTMLTable(theurl), the main odds table is not there. I also can't find the figures if I inspect the oddsportal source closely, which isn't the case in the brazil wikipedia table that is used in the link you provided. I'm afraid I might need more help – lunatus Jun 20 '14 at 16:54
  • Use a javascript/web dev debugger to see what requests the page is making - it might just be a json data request and no scraping is needed, your R can just get the JSON data directly. Maybe. Its just slow and horrible for me. – Spacedman Jun 20 '14 at 18:20

2 Answers2

5

You can use Selenium and RSelenium to get the relevant data:

library(RSelenium)
appURL <- "http://www.oddsportal.com//hockey/usa/nhl/carolina-hurricanes-ottawa-senators-80YZhBGC"
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(appURL)
tblSource <- remDr$executeScript("return tbls[0].outerHTML;")[[1]]
readHTMLTable(tblSource)
> readHTMLTable(tblSource)
$`NULL`
Bookmakers    1    X    2 Payout 
1    bet-at-home  2.25 3.80 2.60  91.6% 
2        Â bet365Â Â 2.29 3.79 2.64  92.7% 
3        Betsson  2.35 3.75 2.65  93.5% 
4           bwin  2.30 3.75 2.70  93.3% 
5    MarathonBet  2.35 3.80 2.78  95.4% 
6       Titanbet  2.30 3.95 2.50  91.9% 
7        TonyBet  2.35 3.70 2.70  93.8% 
8         Unibet  2.35 3.85 2.60  93.5% 
9   William Hill  2.30 3.90 2.50  91.6% 
10        Winner  2.30 3.95 2.50  91.9% 
11        youwin  2.40 3.75 2.55  93.0% 
jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • thank you a lot! after reading this http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html and some trial and error this worked very well! I'm sure I could figure this out with callback calls but this was obviously much easier for me. – lunatus Jun 21 '14 at 06:11
  • @jdharrison Do you care to explain how to find that the table was tbls[0].outerHTML? – MLEN Aug 21 '17 at 09:29
  • `tbls[0].outerHTML` gets the HTML of the first table node. `readHTMLTable` can process HTML representing a table node. – jdharrison Aug 21 '17 at 10:16
2

The "bookies" data comes from a request for a javascript callback resource:

GET /x/bookies-140619144601-1403252087.js HTTP/1.1
Host: rb.oddsportal.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.oddsportal.com//hockey/usa/nhl/carolina-hurricanes-ottawa-senators-80YZhBGC/
Connection: keep-alive

it returns a callback resource that has the bookie info, but no odds. There are other callback AJAX calls for the data, but you'll have to dig.

Burp Proxy is a great way to see the URI calls, but the DOM inspection (as @Spacedman suggested) should always be your first line of investigation.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205