I am trying an efficient means of reading in select data from very large csv files using the method described here by @JD_Long. The method relies on sqldf() a function from a library by the same name. The problem I am having is that the function returns an empty dataframe, that is a dataframe with the correct number of columns and their names but with zero rows.
I've truncated the formatting function that's intended to process the dataframe once it's been read into the working environment by sqldf().
sum_stats_df <- function(f_name){
# read in data
f <- file(f_name, open = "r")
df <- sqldf("select * from f limit 6", dbname = tempfile(),
file.format = list(header = T, row.names = F))
close(f, type='r')
return(df)
The csv datasets that I'm working with are the well-known Bureau of Transportation Statistics flights data which I downloaded from the ASA website here.
I tried the sample code from the StackOverflow question linked above.
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
library(sqldf)
f <- file("bigdf.csv")
bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F))
It worked just fine when I called sum_stats_df("bigdf.csv")
. I tried looking at the properties for the simulation file and the flights csv files using properties in the windows file explorer but they look identical informat. The only difference is the file size. The flights files are much larger, but I don't see how this would matter.
The next thing I tried is loading twenty lines of one of the flights csv files using read.csv(..., nrow=20)
, and then took the resulting dataframe and wrote it back to a csv file. When I tried the problematic function on that new csv test file, it worked.
> print(head(sum_stats_df("test.csv")))
X Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier
1 "1" 1987 10 14 3 741 730 912 849 "PS"
2 "2" 1987 10 15 4 729 730 903 849 "PS"
3 "3" 1987 10 17 6 741 730 918 849 "PS"
4 "4" 1987 10 18 7 729 730 847 849 "PS"
5 "5" 1987 10 19 1 749 730 922 849 "PS"
6 "6" 1987 10 21 3 728 730 848 849 "PS"
But, again, a call using the original csv file returns an empty dataframe.
> print(head(sum_stats_df("2000.csv")))
[1] Year Month DayofMonth DayOfWeek DepTime
[6] CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum
[11] TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay
[16] DepDelay Origin Dest Distance TaxiIn
[21] TaxiOut Cancelled CancellationCode Diverted CarrierDelay
[26] WeatherDelay NASDelay SecurityDelay LateAircraftDelay
<0 rows> (or 0-length row.names)
So I'm wondering what could it be about the csv files that I downloaded from ASA that is different from those that I write locally using write.csv()
? What might be other causes for sqldf()
returning empty dataframes, but getting all the column names?