0

Consider the URL https://masseyratings.com/cb/ncaa-d1/ratings

If one clicks on "More" and chooses "Export" a CSV file of the ratings is downloaded.

How would I use rvest, httr, etc to directly download this file from R? (Ideally I would even skip the step of saving the file and just convert the cvs to a data frame right away, but I would be satisfied either way.) I have tried to trace what is happening using the developer tools in chrome and firefox, but none of the examples with which I am familiar seem to apply to whatever is happening here.

Obviously it's not too difficult to just download the file and read it into R, but I would really like to automate the process.

The html code for the page include this:

<select class='mopulldown' id='pulldownlinks'>
  <option value=''>More
  <option value='cb/ncaa-d1/ratings?c=1'>Conferences
  <option value='/map.php?s=379387&t=11590'>Map
  <option value='/scores.php?s=cb2022&sub=11590'>Scores/Schedule Data
  <option value='/cb2021/ncaa-d1/ratings'>cb2021
  <option value='/team.php?t=11590&s=cb2022&all=1'>Rating Archive
  <option value='/scoredist?s=cb2022&sub=11590&x=s'>Score Distribution
  <option value='/extgms?s=cb2022&sub=11590'>Extreme Games
  <option value='/path?s=cb2022'>Transitive Path
  <option value='exportCSV'>Export
</select>

and it's the last selection that triggers the download of the CSV file.

CS Fuu
  • 11
  • 2
  • 1
    Welcome to SO, CS Fuu! StackOverflow is not suited (nor intended) to be a tutorial site. There are several tutorials and vignettes for `rvest`, I suggest you start at its [website](https://rvest.tidyverse.org/) and see what you can come up with. In general, it is expected that you do your research and attempts *before* asking questions here, and when you do so, you include code you've tried (in a reproducible manner). See https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Mar 11 '22 at 02:17
  • I have used rvest, httr, etc for years to do web scraping and I have looked through numerous examples and tutorials. But I just don't see how to click that button in R (there doesn't seem to be any obvious POST command that would do it). – CS Fuu Mar 11 '22 at 15:03

1 Answers1

0

When I attempt to scrape something like that, I usually open up a web-browser devtools (often F12) and look at network traffic when I click the button; often it points to a GET or POST that returns the JSON data that will give me the data I want. Using the GET/POST url that created that JSON often precludes the need to do any HTML manipulation at all.

In this case, nothing happens when clicking More or export, instead it is already loaded in a clear URL.

url <- "https://masseyratings.com/json/rate.php?argv=kiqB7tdov4KNhxOtPC9JHgV-OZNKvSJFAtoC3YxpTt4s72nWxxwgp35IAExoj-CvP3XmvNm8l6ksrUVUer342g..&task=json"
res <- httr::GET(url)

Confirm status 200:

res
# Response [https://masseyratings.com/json/rate.php?argv=kiqB7tdov4KNhxOtPC9JHgV-OZNKvSJFAtoC3YxpTt4s72nWxxwgp35IAExoj-CvP3XmvNm8l6ksrUVUer342g..&task=json]
#   Date: 2022-03-11 15:08
#   Status: 200
#   Content-Type: application/json
#   Size: 86.3 kB

Look at the data:

dat <- httr::content(res)
str(dat, max.level=1)
# List of 11
#  $ TI          :List of 5
#  $ CI          :List of 20
#  $ RI          : list()
#  $ DI          :List of 358
#  $ timestamp   : num 1.65e+12
#  $ rating      :List of 4
#  $ prevnextpage: int 0
#  $ seas        : chr "cb2022"
#  $ soid        : int 379387
#  $ suboid      : int 11590
#  $ subname     : chr " : NCAA D1"

From here, there is likely a lot that needs to be done to convert that to a data.frame. (FYI, the exported data is imperfect, as it has missing column names and looks as if it does not contain all of the data within dat.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Hmmm, I think what you're getting is the web page, which you would also get with the URL that I gave in the original question. I know that I can clean this up to get what I want, but what I am hoping to do is bypass all that, since the csv that the manual download provides already gives me a clean version of what I want. Maybe it's possible to provide further arguments to the URL that you use, but as you say, the network traffic from devtools shows nothing (I did check all that before I asked the original question). – CS Fuu Mar 11 '22 at 19:39
  • I added some relevant bits from the page source to the original question. And thanks for looking at this. – CS Fuu Mar 11 '22 at 19:46
  • 1
    After looking at the output a bit more, I think I will go ahead and write a little code to manipulate the results into a data frame, which will be a pretty simple task. I guess the question of how to trigger the web site to provide the csv directly will remain a mystery to me. Thanks again. – CS Fuu Mar 11 '22 at 22:55
  • Nope, you're right, I thought the DI had the data that I need (CI has information about the columns), but it doesn't. In fact, I have no idea what some of those numbers represent. Maybe someone else will come along an suss out how to trigger the CVS download. – CS Fuu Mar 12 '22 at 01:43
  • I think that HTML-`form` is completely javascript-managed, which means you may be able to use `RSelenium` to get what you need ... though it's not 100% perfect/deterministic in its behavior (since it relies on load times, etc). Sorry, not much better to work with here, I dislike non-standard web-pages that try to use js-trickery to be "cool". – r2evans Mar 12 '22 at 01:55