4

I am using sqldf library to return a data frame with distinct values and also only the max of the date column. The data frame looks like this

+------+----------+--------+-----------------+
| NAME |   val1   |  val2  |      DATE       |
+------+----------+--------+-----------------+
| A    |  23.7228 | 0.5829 | 11/19/2014 8:17 |
| A    |  23.7228 | 0.5829 | 11/12/2014 8:16 |
+------+----------+--------+-----------------+

When I try to run the below code to get the distinct values with max date

df <-  sqldf("SELECT DISTINCT NAME, val1, val2, MAX(DATE) FROM Table")

I get this as the output.

+------+----------+--------+-----------------+
| NAME |   val1   |  val2  | MAX(DATE)       |
+------+----------+--------+-----------------+
| A    |  23.7228 | 0.5829 | 1416406625      |
+------+----------+--------+-----------------+

Please let me know how do I convert the last column, which is an integer to get back my datetime format.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Sharath
  • 2,225
  • 3
  • 24
  • 37

3 Answers3

2

Next time please provide your input in reproducible form. I have done it this time for you below. Also the SQL code in the question has an SQLite syntax error which I have fixed below.

The easiest way to get this right is to use the name DATE for the output column in which case sqldf will figure that its of the same type as the DATE input column. SQLite has no date and time types so there is no way for sqldf to know that what is being returned is a datetime when using sqldf with SQLite. sqldf uses some heuristics to guess such as the one just discussed.

library(sqldf)

Lines <- "NAME,val1,val2,DATE     
A,23.7228,0.5829,11/19/2014 8:17 
A,23.7228,0.5829,11/12/2014 8:16"

Table <- read.csv(text = Lines, as.is = TRUE)
Table$DATE <- as.POSIXct(Table$DATE, format = "%m/%d/%Y %H:%M")

sqldf("SELECT DISTINCT NAME, val1, val2, MAX(DATE) DATE FROM 'Table'")

giving:

   NAME    val1   val2                DATE
1     A 23.7228 0.5829 2014-11-19 08:17:00

If we used H2 with sqldf then we would not have these problems since H2 does support date and time types so sqldf does not have to guess. Also the syntax of your SQL query works as is in H2. Using the Table data.frame shown above:

library(RH2)
library(sqldf)

sqldf("SELECT DISTINCT NAME, val1, val2, MAX(DATE) DATE FROM Table")

gives:

   NAME    val1   val2           MAX(DATE)
1     A 23.7228 0.5829 2014-11-19 08:17:00
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • The RH2 library did the trick. Thanks for pointing that out. I am able to see the desired output and got it working. I will be more responsible in providing my input in a much more reproducible form from now. Is there a link that I can refer to for guidelines if any? – Sharath Mar 26 '15 at 21:20
  • Check out http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example . By the way, although this example seems to work in H2 I am not sure that H2 really allows you to take `MAX` without a `group by` clause in general. SQLite does but that is an extension that SQLite provides that is not necessarily available in other SQL database systems. If you are going to use H2 for this then add a group by clause to be safe. `select NAME, val1, val2, MAX(DATE) from Table group by NAME, val1, val2`. – G. Grothendieck Mar 26 '15 at 21:27
  • I am using the H2 for now. Yes. I always add the group by whenever I use an aggregate function. Thanks for the link. Very useful. – Sharath Mar 26 '15 at 21:31
0

Try:

> as.POSIXct(1416406625, origin = "1970-01-01", tz = "GMT")
[1] "2014-11-19 14:17:05 GMT"

You may need to change the timezone (tz) to get the right time.

szabad
  • 380
  • 3
  • 10
0

I prefer to use a little workaround to make sure that my dates stay dates when passing through sqldf.

This pseudo-code will convince sqldf to give your date back as a date after applying a function to it:

DATE(SOMEFUNCTION(date_field) * 86400, 'unixepoch')

or

DATETIME(SOMEFUNCTION(date_field) * 86400, 'unixepoch')

Basically, when you apply a function to a date, sqldf converts the date into a numeric representation of that date in a version of Unix time, but doesn't bother to convert it back. Also, frustratingly it gives you the Unix time in units (days) different from ones used by the DATE/DATETIME functions (seconds). As such, you have to covert them by multiplying the number sqldf gives you by the number of seconds in a day (86400) and then you have the right Unix time.

In your specific case (changing DATE to date_field to avoid confusion with the function name):

df <- sqldf("SELECT DISTINCT NAME, val1, val2, DATETIME(MAX(date_field) * 86400, 'unixepoch') FROM Table")

It makes for clunky SQL, but saves you some arguably more clunky R code. There are other benefits as well such as making the SQL more robust and reusable. Also, R has a bad habit of changing your datatypes which can can catch up to you quickly and it is a good idea to nip it in the bud any chance you get.

imfm
  • 119
  • 7