0

I am doing a project where I need to download FAFSA completion data from this website: https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school

I am using rvest to webscrape that data, but when I try to use the function read_html on the link, it never reads in and eventually I have to stop execution. I can read in other websites, so I'm not sure if it is a website specific issue or if I'm doing something wrong. Here is my code so far:

library(rvest)

fafsa_link <- "https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school"

read_html(fafsa_link)

Any help would be greatly appreciated! Thank you!

QHarr
  • 83,427
  • 12
  • 54
  • 101

1 Answers1

0

An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)

library(magrittr)
library(httr)
library(stringr)

data  <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>% 
         content(as = "text")

ca <- data %>%  stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • The above gets you the links to the files (`ca` and `ma`). You can then read into df e.g. https://stackoverflow.com/questions/41368628/read-excel-file-from-a-url-using-the-readxl-package – QHarr Apr 23 '21 at 20:34
  • Okay thank you! I tried running that code and then doing `GET(ca, write_disk(tf <- tempfile(fileext = ".xls"))) ` and then `ca <- read_excel(tf)` And I am experiencing a similar time-out issue as before. – lizronan22 Apr 26 '21 at 14:03
  • I suspect needs user-agent header in request for file – QHarr Apr 26 '21 at 15:39