7

I am reading from a huge text file that has '%d/%m/%Y' date format. I want to use read.csv.sql of sqldf to read and filter the data by date at the same time. This is to save memory usage and run time by skipping many dates that I am not interested in. I know how to do this with the help of dplyr and lubridate, but I just want to try with sqldf for the aforementioned reason. Even though I am quite familiar with SQL syntax, it still gets me most of the time, no exception with sqldf.

Running command like following returned a data.frame with 0 row:

first_date <- "2001-11-1"
second_date <- "2003-11-1"
query <- "select * from file WHERE strftime('%d/%m/%Y', Date, 'unixepoch', 'localtime') between
                    '$first_date' AND '$second_date'"
df <- read.csv.sql(data_file,
                       sql= query,
                       stringsAsFactors=FALSE,
                       sep = ";", header = TRUE)

So for simulation, I tried with sqldf function like the following:

first_date <- "2001-11-1"
second_date <- "2003-11-1"
df2 <- data.frame( Date = paste(rep(1:3, each = 4), 11:12, 2001:2012, sep = "/"))
sqldf("SELECT * FROM df2 WHERE strftime('%d/%m/%Y', Date, 'unixepoch') BETWEEN '$first-date' AND '$second_date' ") 

# Expect:
# Date
# 1  1-11-2001
# 2  1-12-2002
# 3  1-11-2003
biocyberman
  • 5,675
  • 8
  • 38
  • 50
  • Are you sure the format in the file is not dd-mm-yyyy? Typically if day or month is 1, say, then its expressed as 01 and not 1. – G. Grothendieck Sep 07 '14 at 20:07
  • Data in the text file really has dates like this `1/1/2011`, not `01/01/2011`. And either case, I think `strftime` function of `SQLite` driver would be able to convert to the later format. My difficulty here is that I don't know how to test the `strftime` function in this case. – biocyberman Sep 07 '14 at 20:16
  • SQLite doesn't have a date type. You might find useful this: http://stackoverflow.com/questions/4428795/sqlite-convert-string-to-date – nicola Sep 07 '14 at 20:30
  • @nicola That's true. That's why I use `strftime` function to convert time strings to dates on-the-fly and compare on this resulting dates. I am just not familiar with the the function to make correct call. – biocyberman Sep 07 '14 at 21:38

1 Answers1

7

strftime strftime with percent codes is used to convert an object already regarded by sqlite as a datetime to something else but you want the reverse so the approach in the question is not going to work. For example, here we convert the current time into a dd-mm-yyyy string:

library(sqldf)
sqldf("select strftime('%d-%m-%Y', 'now') now")
##          now
## 1 07-09-2014

Discussion Since SQlite lacks date types its a bit onerous to handle this, particularly with the 1-or-2-digit non-standard date formats, but if you really want to use SQLite we can do it by tediously parsing out the date strings. Using fn$ from the gsubfn package for string interpolation eases this a little.

Code Below zero2d outputs SQL code to prepend a zero character to its input if its one digit. rmSlash outputs SQL code to remove any slashes in its argument. Year, Month and Day each output SQL code to take a character string representing a date in the format under discussion and extract the indicated component reformatting it as a 2 digit zero filled character string in the case of Month and Day. fmtDate takes a character string of the form shown in the question for first_string and second_string and outputs a yyyy-mm-dd character string.

library(sqldf)
library(gsubfn)

zero2d <- function(x) sprintf("substr('0' || %s, -2)", x)

rmSlash <- function(x) sprintf("replace(%s, '/', '')", x)

Year <- function(x) sprintf("substr(%s, -4)", x)

Month <- function(x) {
   y <- sprintf("substr(%s, instr(%s, '/') + 1, 2)", x, x)
   zero2d(rmSlash(y))
}

Day <- function(x) {
   y <- sprintf("substr(%s, 1, 2)", x)
   zero2d(rmSlash(y))
}

fmtDate <- function(x) format(as.Date(x))

sql <- "select * from df2 where 
  `Year('Date')` || '-' || 
  `Month('Date')` || '-' || 
  `Day('Date')`
  between '`fmtDate(first_date)`' and '`fmtDate(second_date)`'"
fn$sqldf(sql)

giving:

       Date
1 1/11/2001
2 1/12/2002
3 1/11/2003

Notes

1) SQLite functions used instr, replace and substr are core sqlite functions

2) SQL The actual SQL statement that is executed after fn$ performs the substitutions is as follows (slightly reformatted to fit):

> cat( fn$identity(sql), "\n")
select * from df2 where 
  substr(Date, -4) 
  || '-' || 
  substr('0' || replace(substr(Date, instr(Date, '/') + 1, 2), '/', ''), -2) 
  || '-' || 
  substr('0' || replace(substr(Date, 1, 2), '/', ''), -2)
  between '2001-11-01' and '2003-11-01' 

3) source of complications the main complication is the non-standard 1-or-2 digit day and month. Had they been consistently 2 digits it would have reduced to this:

first_date <- "2001-11-01"
second_date <- ""2003-11-01"

fn$sqldf("select Date from df2 
   where substr(Date, -4) || '-' || 
         substr(Date, 4, 2) || '-' ||
         substr(Date, 1, 2)
   between '`first_date`' and '`second_date`' ")

4) H2 Here is an H2 solution. H2 does have a datetime type simplifying the solution substantially over SQLite. We assume that the data is in a file called mydata.dat. Note that read.csv.sql does not support H2 as H2 already has the internal csvread SQL function to do that:

library(RH2)
library(sqldf)

first_date <- "2001-11-01"
second_date <- "2003-11-01"

fn$sqldf(c("CREATE TABLE t(DATE TIMESTAMP) AS
  SELECT parsedatetime(DATE, 'd/M/y') as DATE
  FROM CSVREAD('mydata.dat')", 
  "SELECT DATE FROM t WHERE DATE between '`first_date`' and '`second_date`'"))

Note that the first RH2 query will be slow in a session since it loads java. After that you can try it out to see if the performance is adequate.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This is quite a laborious solution. Thanks a lot for work it out for me. – biocyberman Sep 08 '14 at 16:51
  • 1
    Have added a note at the end pointing out that the complication is mostly from the 1-or-2 digit day and month and had they been consistently 2 digits the solution reduces substantially. – G. Grothendieck Sep 08 '14 at 17:10
  • You are absolutely right. I was thinking about using `Rcpp` or something using external program to like `sed` to fix the dates before using R to work on the data. But since I am rather new to R, I want to learn how it works in R. On a side matter, I have just found out that you are one of the authors of `RH2` package. My respect, and a pleasant surprise :-). Do you mind demonstrating how the work can be done with `RH2` instead of `SQLite`? Thanks again. – biocyberman Sep 08 '14 at 17:22
  • From what is written here: http://www.h2database.com/html/datatypes.html#date_type . I think that we can't avoid the steps of extracting `year`, `month`, and `day` and reformatting the dates. So the code would not look shorter. Is that true? – biocyberman Sep 08 '14 at 17:54
  • 1
    Have added an H2 solution at the end. – G. Grothendieck Sep 08 '14 at 18:36
  • That looks definitely cleaner. I tried with the real data and got error like: `execute JDBC update query failed in dbSendUpdate (Out of memory`. I ran the code in RStudio, and wanted to select all the columns. That probably hits the limit of RStudio. So, I will stay at `SQLite`-based solution for now. – biocyberman Sep 08 '14 at 20:28
  • Try this in R: `options(java.parameters="-Xmx1024m")` before loading any packages or use some other memory size. – G. Grothendieck Sep 08 '14 at 21:27