Getting '404 Not Found' error when reading data from url, despite file existing

Question

I am writing a program to collect all of the daily .csv files from this page. However, for some of the files, I get the error message:

Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
  cannot open URL 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05042016_DailyAbsenceData.csv': HTTP status was '404 Not Found'

Here is an example from the May 12, 2016 file:

read.csv(url("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05122016_DailyAbsenceData.csv"))

The bizarre thing is, if you go to the website, find the link to that file and click it, R no longer gives the error and reads the file correctly. What is going on here and how can I read those files without having to click them manually? (Note, only the first one of you is going to be able to replicate the problem, because clicking the file fixes it for the rest.)

Ultimately, I want to use the following loop to collect all the files:

# Create a vector of dates. This is the interval data is collected from. 
dates = seq(as.Date("2016-05-1"), as.Date("2016-05-30"), by="days")
# Format to match the filename prefixes
dates = strftime(dates, '%m%d%Y')
# Create the vector of a file names I want read. 
file.names = paste(dates,"_DailyAbsenceData.csv", sep = "")

# A loop that reads the .csv files into a list of data frame
daily.truancy = list()
for (i in 1:length(dates)) {
  tryCatch({ #this function prevents the loop from stopping from an error when read.csv cannot access the file
    daily.truancy[[i]] = read.csv(url(paste("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/", file.names[i], sep = "")), sep = ",")
    stop("School day") #this indicates that the file was successfully read in to the list
  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

# Unlist the daily data to a large panel
daily.truancy.2016 <- do.call("rbind", daily.truancy)

Note that the same error message is given for days when there is, in fact, no file (weekends). This is not a problem.

I don't get that error for the file you listed. If you are getting a 404 return code form a website, you should contact that website to find out why. There's nothing R can do to make a file that doesn't exist suddenly appear. These file may be generated on demand. — MrFlick, Apr 05 '17 at 04:45
They may be blocking your client specifically to prevent automatic downloads of the file. — Burhan Khalid, Apr 05 '17 at 04:50
MrFlick, if the files are created on demand, is there a way to trigger the generating process in any other way except by clicking the icon in the web page? Is there a way to trigger it in R? — RobustTurd, Apr 05 '17 at 05:33
MrFlick, the files DO exist, at least as clickable links. Thats the whole point of the post. — RobustTurd, Apr 05 '17 at 05:44

score 1 · Answer 1 · edited May 23 '17 at 12:25

Since the pages are dynamically generated url function will not work here butRSelenium was expressly designed was such tasks.

I want to thank @jdharrison for this superb package as well as his answers to challenging questions, see his answers page for more examples.

Basic setup procedure is explained here: RSelenium Setup

To extract the elementID of our interest the easiest way is to right-click on the element and click "Inspect" in chrome, I am not sure of other browsers,they should have similar functionality with possibly different names

This will open a side window containing html tags for the selected element.

library(RSelenium)
RSelenium:::startServer()

#you can replace browser name with your version e.g. firefox

remDr <- remoteDriver(browserName = "chrome")
remDr$open(silent = TRUE)

appURL <- 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/AttendanceReports.aspx'


monthYearCounter = 1

#total months to download
totalMonths = 2 

remDr$navigate(appURL)


for(monthYearCounter in 1:totalMonths) {


#Active monthYear on the page e.g April 2017
monthYearElem = remDr$findElement("xpath", "//td[contains(@style,'width:70%')]")

#highlights the element in yellow for visual feedback
monthYearElem$highlightElement()

#extract text
monthYearText = unlist(monthYearElem$getElementAttribute("innerHTML"))

cat(paste0("Processing month year=",monthYearText,"\n"))



# For a particular month all the CSV files are listed in a table



#extract elementID of all CSV files using the pattern "imgBtnXls"
csvFilesElemList = remDr$findElements("xpath", "//input[contains(@id,'imgBtnXls')]")


#For all elements, enable click function and save file to default download location
#Ensure delay between consecutive requests from burdening the servers

lapply(csvFilesElemList,function(x) {

#
x$clickElement()

#Be nice, do no overload servers with rapid requests!!

Sys.sleep(60)

})



#Go to previous month

remDr$findElement("xpath", "//a[contains(@title,'Go to the previous month')]")$clickElement()


}

Thanks for the answer. I get the following error with your code: `> RSelenium:::startServer() Error: startServer is now defunct. Users in future can find the function in file.path(find.package("RSelenium"), "examples/serverUtils"). The recommended way to run a selenium server is via Docker. Alternatively see the RSelenium::rsDriver function.)` — RobustTurd, Apr 05 '17 at 22:43

Getting '404 Not Found' error when reading data from url, despite file existing

1 Answers1