3

I would like to web scrape this web site

In particular I would like to take the information that it is in that table: enter image description here

Please note that I choose a specific date on the upper right corner.

By following this guide

I wrote the following code

library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'

webpage_nba <- read_html(url_nba)

#Using CSS selectors to scrap the rankings section
data_nba <- html_nodes(webpage_nba,'#standings-table')

#Converting the ranking data to text
data_nba <- html_text(data_nba)
write.csv(data_nba,"web scraping test.csv")

From my understanding the numbers that I want to get ( e.g. For Warriors it would be 94%, 79%, 66%, 59%) are "coded" in a different way. In other words, what it is written in the web scraping test.csv is not readable.

Is there any way that I can transform the "coded numbers" into "regular numbers" ?

Cœur
  • 37,241
  • 25
  • 195
  • 267
quant
  • 4,062
  • 5
  • 29
  • 70
  • 2
    First, you can use `html_table(webpage_nba)` to extract a list of all tables from the html - that's quite a handy function if you're interested in html tables. but are you sure that your code actually extracts the table at all ? I would doubt that as I see a lot of javascript here which does not mean good things for web scraping, e.g. your selection is not reflected in the html source.. have you checked out https://github.com/fivethirtyeight/data/tree/master/nba-elo ? I'm not a nba kid, but maybe you can find the data there? – friep Jul 26 '17 at 11:58
  • indeed `html_table(webpage_nba)` will give me the tables. but then 2 questions arise: 1) in the columns for the 3rd table (after you run this command) there are `` instead of normal numbers. How could I "translate" them ? 2) How could I chose the certain date from the upper right (April 14 before playoffs). The NBA, is just an example to prove my point – quant Jul 26 '17 at 12:04
  • I see. The normal numbers are not there because it reads the "empty" table, before your selection. Quick googling reveals that is indeed the check mark from the first rows. I'd try to do the selection on your website of choice and then go to the source (right click "View page source") and do a quick strg+f for something you see visually (e.g. 94%). If you can't find it, you can't scrape it easily and you need to look for 'scrape javascript generated data in R' on google. There's no off-the-shelves solution, I think, you need to do some digging for your specific case and try it out. – friep Jul 26 '17 at 12:12
  • Similar to the last point you can find the 94% after filtering for your desired date. Its in the same position as the check mark; (`webpage_nba %>% html_nodes(".pct.div.break")` . The key is that the url retrieves the last date and therefore does not correspond to the filtering. There are some related issues that deal with 'search results scraping with R' – timfaber Jul 26 '17 at 12:38

2 Answers2

4

I tried parse the data using rvest, but it seems that challenging problem here is to click dropdown menu, represented by <select> tag in HTML structure. So I equipped heavy artillery - RSelenium which is browser emulator. Using it everything became easy, thanks to the answer on SO:

library(RSelenium)
library(rvest)

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'


#initiate RSelenium. If it doesn't work, try other browser engines
rD <- rsDriver(port=4444L,browser="firefox")
remDr <- rD$client

#navigate to main page
remDr$navigate(url_nba)

#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()

# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
rD[["server"]]$stop() 

# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]

# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]

df

    ELO Carm-ELO 1-Week Change          Team Conf. Conf. Semis Conf. Finals Finals Win Title
4  1770     1792           -14      Warriors  West         94%          79%    66%       59%
5  1661     1660           -43         Spurs  West         90%          62%    15%       11%
6  1600     1603           +33       Raptors  East         77%          47%    25%        5%
7  1636     1640           +33      Clippers  West         58%          11%     7%        5%
8  1587     1589           -22       Celtics  East         70%          42%    24%        4%
9  1587     1584            -9       Wizards  East         79%          38%    21%        4%
10 1617     1609           +16          Jazz  West         42%           7%     5%        3%
11 1602     1606           -18       Rockets  West         70%          27%     5%        3%
12 1545     1541           -22     Cavaliers  East         59%          27%    11%        2%
13 1519     1523           +25         Bulls  East         30%          15%     7%       <1%
14 1526     1520           +37        Pacers  East         41%          17%     6%       <1%
15 1563     1564            +6 Trail Blazers  West          6%           3%     1%       <1%
16 1543     1537           -20       Thunder  West         30%           8%    <1%       <1%
17 1502     1502            -3         Bucks  East         23%           9%     3%       <1%
18 1479     1469           +46         Hawks  East         21%           6%     2%       <1%
19 1482     1480           -41     Grizzlies  West         10%           3%    <1%       <1%
20 1569     1555           +32          Heat  East           —            —      —         —
21 1552     1533           +27       Nuggets  West           —            —      —         —
22 1482     1489           -12      Pelicans  West           —            —      —         —
23 1463     1472           -18  Timberwolves  West           —            —      —         —
24 1463     1462           -40       Hornets  East           —            —      —         —
25 1441     1436           +22       Pistons  East           —            —      —         —
26 1420     1421           -20     Mavericks  West           —            —      —         —
27 1393     1395            -2         Kings  West           —            —      —         —
28 1374     1379           -13        Knicks  East           —            —      —         —
29 1367     1370           +47        Lakers  West           —            —      —         —
30 1372     1370           -14          Nets  East           —            —      —         —
31 1352     1355            -9         Magic  East           —            —      —         —
32 1338     1348           -29         76ers  East           —            —      —         —
33 1340     1337           +26          Suns  West           —            —      —         —

If you want to parse other time periods, check the option value in the HTML of the page using the Dev Tools of your browser.

Alex Knorre
  • 620
  • 4
  • 15
  • 1
    I tried both `browser=firefox` and `browser=chrome` but in both cases I get an error `[1] "Connecting to remote server" Error in checkError(res) : Couldnt connect to host on http://localhost:4444/wd/hub. Please ensure a Selenium server is running.` – quant Jul 26 '17 at 13:12
  • @quant Have you properly installed `RSelenium` and all its dependencies? Try to execute `RSelenium::rsDriver()` and `wdman::selenium(port = 4444L)`. Does it work without errors?` – Alex Knorre Jul 26 '17 at 14:38
  • the `RSelenium::rsDriver()` gives the same error. I also tried re-installing the package. I got the same error – quant Jul 26 '17 at 15:16
  • How would you modify this: `webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")`, in order to click 2 times on the left arrow on the small table, at the top of the `url_nba` ? ( The one that says `Games on Jun. 12, 2017`) – quant Jul 27 '17 at 07:52
  • 1
    `webElem <- remDr$findElement(using = 'xpath', value = '//*[(@id = "arrow-left")]')` FYI: you can use [SelectorGadget](http://selectorgadget.com/) to get xpath of an element and paste it in your code to work with any elements of page you want. – Alex Knorre Jul 27 '17 at 09:02
  • yes i have the selector gadget :). but i meant, how can you click it 2 times ? – quant Jul 27 '17 at 09:05
  • 1
    ahh ezpz:) just execute `webElem$clickElement()` two times – Alex Knorre Jul 27 '17 at 09:08
1

Thanks to @Alexey answer and this, the following code worked for me

library(RSelenium)
library(rvest)
library(wdman)

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'


#initiate RSelenium. If it doesn't work, try other browser engines
# rD <- rsDriver()
# remDr <- rD$client

pDrv <- phantomjs(port = 4567L)
remDr <- remoteDriver(browserName = "phantomjs", port = 4567L)
remDr$open()
#navigate to main page
remDr$navigate(url_nba)

#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()

# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
pDrv$stop()

# rD[["server"]]$stop() 


# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]

# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]
df
quant
  • 4,062
  • 5
  • 29
  • 70