3

I am trying an efficient means of reading in select data from very large csv files using the method described here by @JD_Long. The method relies on sqldf() a function from a library by the same name. The problem I am having is that the function returns an empty dataframe, that is a dataframe with the correct number of columns and their names but with zero rows.

I've truncated the formatting function that's intended to process the dataframe once it's been read into the working environment by sqldf().

sum_stats_df <- function(f_name){
  # read in data
  f <- file(f_name, open = "r")
  df <- sqldf("select * from f limit 6", dbname = tempfile(),
              file.format = list(header = T, row.names = F))
  close(f, type='r')
  return(df)

The csv datasets that I'm working with are the well-known Bureau of Transportation Statistics flights data which I downloaded from the ASA website here.

I tried the sample code from the StackOverflow question linked above.

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
library(sqldf)
f <- file("bigdf.csv")
bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F))

It worked just fine when I called sum_stats_df("bigdf.csv"). I tried looking at the properties for the simulation file and the flights csv files using properties in the windows file explorer but they look identical informat. The only difference is the file size. The flights files are much larger, but I don't see how this would matter.

The next thing I tried is loading twenty lines of one of the flights csv files using read.csv(..., nrow=20), and then took the resulting dataframe and wrote it back to a csv file. When I tried the problematic function on that new csv test file, it worked.

> print(head(sum_stats_df("test.csv")))
    X Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier
1 "1" 1987    10         14         3     741        730     912        849          "PS"
2 "2" 1987    10         15         4     729        730     903        849          "PS"
3 "3" 1987    10         17         6     741        730     918        849          "PS"
4 "4" 1987    10         18         7     729        730     847        849          "PS"
5 "5" 1987    10         19         1     749        730     922        849          "PS"
6 "6" 1987    10         21         3     728        730     848        849          "PS"

But, again, a call using the original csv file returns an empty dataframe.

> print(head(sum_stats_df("2000.csv")))
 [1] Year              Month             DayofMonth        DayOfWeek         DepTime          
 [6] CRSDepTime        ArrTime           CRSArrTime        UniqueCarrier     FlightNum        
[11] TailNum           ActualElapsedTime CRSElapsedTime    AirTime           ArrDelay         
[16] DepDelay          Origin            Dest              Distance          TaxiIn           
[21] TaxiOut           Cancelled         CancellationCode  Diverted          CarrierDelay     
[26] WeatherDelay      NASDelay          SecurityDelay     LateAircraftDelay
<0 rows> (or 0-length row.names)

So I'm wondering what could it be about the csv files that I downloaded from ASA that is different from those that I write locally using write.csv()? What might be other causes for sqldf() returning empty dataframes, but getting all the column names?

Conner M.
  • 1,954
  • 3
  • 19
  • 29
  • One problem I see is that you are using `LIMIT` without `ORDER BY`. This doesn't make sense because you aren't telling the database _how_ you want to limit the result set. _Which_ 6 records do you want? – Tim Biegeleisen Jan 09 '18 at 05:06
  • As far as I know, and I might be wrong, using limit this way would be analogous getting just the first six rows. – Conner M. Jan 09 '18 at 05:11
  • But I've tried it without limit and get the same empty dataframe. – Conner M. Jan 09 '18 at 05:11
  • 1
    Please read [this accepted answer](https://stackoverflow.com/questions/12626637/reading-a-text-file-in-r-line-by-line) and see if you can read from the `file` as you have opened it in your function. – Tim Biegeleisen Jan 09 '18 at 05:17
  • Added `l <- readLines(con=f, n=6); print(l)` and got the header and five full lines from the file. I've read many answers here saying that the sql objects are deleted immediately after the call, but it doesn't explain why the results are first assigned to `df` nor why it works with locally written csv files. – Conner M. Jan 09 '18 at 05:28
  • I upvote this, but I don't know the API well enough to be of further help. Yes, once you assign to the data frame it should stick. Maybe try inlining the entire code from the function. – Tim Biegeleisen Jan 09 '18 at 05:30

2 Answers2

1

SQLite is fussy about end of line characters and these can differ from the default on your OS if the file is from somewhere else.

On Windows and Linux Ubuntu this worked for me assuming that you have already unzipped it:

read.csv.sql("1987.csv", "select * from file limit 3", eol = "\n")

or on Windows this variation using the original bz2 file and assuming that the 7z executable is on your path:

read.csv.sql("1987.csv", "select * from file limit 3", eol = "\n",
    filter = "7z x -so 1987.csv.bz2 2>NUL")

On Linux we can use bzip2 instead:

read.csv.sql("1987.csv", "select * from file limit 3", eol = "\n",
    filter = "bzip2 -cd 1987.csv.bz2")

FAQ 16 on the sqldf github page has an example that is very similar to this one.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

The files you can download from the quoted web page are bzip compressed. I can reproduce your problems when using the compressed file directly. One possible solution is to decompress the file(s) before using them with sqldf. This is unfortunate since other tools for reading CSV files support bzip compressed files directly. I do not know how to do this with sqldf() directly, but read.csv.sql from the same package supports a filter option, making the following function possible:

sum_stats_df <- function(f_name){
  read.csv.sql(f_name, sql = "select * from file limit 6", filter = "bunzip2")
}
Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
  • Thanks for your input, but I unzipped the files before trying to load them. I wonder, though, if there is some processing or formatting going on with the unzipper that might be behind the problem. – Conner M. Jan 09 '18 at 16:11
  • Which tool on which OS did you use to unzipp the files? What file sizes do you currently have? For comparison, the data from 2008 is 109 MB with and 658 MB without compression. – Ralf Stubner Jan 09 '18 at 16:20
  • Opening the locally created csv files and the downloaded and unzipped files in Excel doesn't reveal any difference in the two. But opening them in notebook++ showed some quotes for values in some columns in the former. With this in mind I tried defining the class for each column using the method arg. Still didn't work. – Conner M. Jan 09 '18 at 16:42
  • I used 7_zip, and the unzipped 2008 csv file is 673MB in my case. – Conner M. Jan 09 '18 at 16:44
  • This is very odd. Can you add `print(f)` to your function after opening the file and post the result when calling the function on one of the files? – Ralf Stubner Jan 09 '18 at 20:52