Converting SQL query into data.table in R for grouping by min

Question

I wondered if you could help. I am trying to replicate this MS Access SQL query in R using data.table. The query is:

SELECT Tbl_X.code, 
       Min(Tbl_X.DateTime) AS MinOfDateTime, 
       Max(Tbl_X.EndDate) AS MaxOfEndDate
FROM Tbl_X
GROUP BY Tbl_X.code
HAVING (((Min(Tbl_X.DateTime)) < DateAdd("d",1,[Forms]![Frm_ControlPanel]![CalcDateFromForm])))
ORDER BY Min(Tbl_X.DateTime);

The HAVING part is to filter out any observations on today's date.

My R code is:

library(data.table)
...

events <- as.data.table(event_data)[DateTime < max(DateTime),]
                                   [order(DateTime),
                                   .(Code, 
                                     Min_event_date = min(DateTime), 
                                     Max_validity_date = max(EndDate)),
                                   by = Code]

I basically want a a single observation for each code that has the first date it's seen and the last date grouped by the code based on its first observation as the code can appear multiple times for every time its activated.

so from:

Code  DateTime     EndDate
 A    2017-02-09   2017-04-09
 A    2017-04-09   2017-06-09
 A    2017-07-09   2017-09-09
 B    2017-03-04   2017-05-11
 B    2017-06-04   2017-08-13

To

Code  DateTime     EndDate
 A    2017-02-09   2017-09-09
 B    2017-03-04   2017-08-13

However, I am getting a slightly lower total figure that from the SQL. So I'm wondering if, firstly is what I am doing replicating the SQL and whether that is the most efficient way.

If you're getting a lower # rows but do have only one row per Code, then it must be the HAVING part is borked / mistranslated, right? Your code looks fine except you don't need `.(Code, ` since you already get it from `by=Code`. If you could post an example, it might help to pin it down. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 — Frank, Aug 16 '17 at 17:55
Thank you, I think Parfait answer has addressed it. I would of struggled with a reproducible as there is > 200k rows in the database :) — MrMonkeyBum, Aug 17 '17 at 13:18

score 3 · Accepted Answer · answered Aug 16 '17 at 19:12

In SQL, the HAVING clause usually is a filter using an aggregate (unlike the WHERE clause filter) and you do use an aggregate: Min(Tbl_X.DateTime). Additionally, you are filtering one day in advance of the Access' form control value which you say is today's date since you use DateAdd().

Hence consider running the data.table bracketing twice: first to aggregate then to filter on returned aggregate, Min_event_date.

events <- as.data.table(event_data)[order(DateTime),
                                    .(Min_event_date = min(DateTime), 
                                      Max_validity_date = max(EndDate)),
                                    by = Code][Min_event_date < Sys.Date() + 1,]

Likely, too this is the order of operations in SQL where the GROUP BY precedes HAVING when the database engine runs virtual tables in background before final resultset.

Thank you that is running better and I've realised there was a mistake in a follow-up query I was doing around the dates. I am matching now. — MrMonkeyBum, Aug 17 '17 at 13:14

Converting SQL query into data.table in R for grouping by min

1 Answers1