1

I'm trying to scrape the first table from this url:

https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal

using the following code:

url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="top-player-stats-summary-grid"]')

which gives data a value of {xml_nodeset (0)}

url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
  read_html() %>%
  html_nodes(css='.grid')

gives the same problem.

Apparently this might be a javascript issue - is there a fast way to extract the relevant data? Inspecting the table entries seems to show that the data is not imported from elsewhere but is coded into the page, so it seems I should be able to extract it from the source code (sorry, I am completely ignorant of how HTML and JS work so my question might not make sense).

natedjurus
  • 319
  • 3
  • 11

1 Answers1

2

The page dynamically updates content via javascript running on page when using browser. This doesn't happen with rvest. You can however observe in dev tools network tab the xhr call which returns this content as json

require(httr)
require(jsonlite)

headers = c('user-agent' = 'Mozilla/5.0',
            'accept' = 'application/json, text/javascript, */*; q=0.01',
           'referer' = 'https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal',
            'authority' = 'www.whoscored.com',
            'x-requested-with' = 'XMLHttpRequest')

params = list(
  'category' = 'summary',
  'subcategory' = 'all',
  'statsAccumulationType' = '0',
  'isCurrent' = 'true',
  'playerId' = '',
  'teamIds' = '158',
  'matchId' = '318578',
  'stageId' = '',
  'tournamentOptions' = '',
  'sortBy' = '',
  'sortAscending' = '',
  'age' = '',
  'ageComparisonType' = '',
  'appearances' = '',
  'appearancesComparisonType' = '',
  'field' = '',
  'nationality' = '',
  'positionOptions' = '',
  'timeOfTheGameEnd' = '',
  'timeOfTheGameStart' = '',
  'isMinApp' = '',
  'page' = '',
  'includeZeroValues' = '',
  'numberOfPlayersToPick' = ''
)

r <- httr::GET(url = 'https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics', httr::add_headers(.headers=headers), query = params)

data <- jsonlite::fromJSON(content(r,as="text") )
print(data$playerTableStats)

Small sample of contents of data$playerTableStats via View(data$playerTableStats). You would parse as required for info you want in format you want.

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Can you please explain how did you define the parameters and values inside the **headers** and **params** ? Can this approach be applied as an alternative of `selenium` for those sites which has data in server side? – maydin Aug 18 '19 at 20:12
  • 1
    it depends on how the page updates content (if it dynamically updates). You can view the request by opening dev tools with F12 and going to network tab, press F5 to refresh page then Ctrl + F to open Find box and type Robinson then press enter. You will find the request shown there along with params and headers. I ran with python to test removing some of the headers. You can probably remove more. See [here](https://stackoverflow.com/a/56279841/6241235) and [here](https://stackoverflow.com/a/56924071/6241235) – QHarr Aug 18 '19 at 20:16
  • 1
    A few seconds later after reading your comment, I perceived the *Robinson* as a type of function works in javascript :))). Then I realized that it is the name of the player. Ok, thank you. I got the point...I will make some trials on this method. – maydin Aug 18 '19 at 20:30
  • 1
    Thanks! I think I'm starting to understand now. Weirdly, the first time I ran the code you stated it worked and I was able to store the data, but the second time I got this error: `data <- jsonlite::fromJSON(content(r,as="text") )` `No encoding supplied: defaulting to UTF-8. Error: lexical error: invalid char in json text. – natedjurus Aug 18 '19 at 22:29
  • I think the page probably has anti scraping measures and from time to time you get a different result which is html. If that is the case then browser automation is the next thing to try e.g. RSelenium – QHarr Aug 18 '19 at 22:31
  • Thanks - yes it seems as if I am blocked from reading the JSON. Do you have any recommendations or know any tutorials on how to get started with using RSelenium for this purpose? I've googled a bit but cannot find anything specific on how to use it to get info from javascript tables (I really just need the ratings). Thanks! – natedjurus Aug 19 '19 at 12:00
  • By using RSelenium you automate browser so javascript will run. There is nothing else special to do apart from potentially using a wait condition for presence of table element. – QHarr Aug 19 '19 at 12:34
  • https://ropensci.org/tutorials/rselenium_tutorial/ this looks alright. You can then use remDr$findElement(using = 'css selector', "selector_for_table") – QHarr Aug 19 '19 at 12:35