I've been trying (and searching for) a way to a sum over ranges of dates for every ID. I've looked here and everywhere for an SQL syntax to do this, since it seems to me this should be a fairly common problem, but I've not found something that does exactly what I want, for example this.
I have a table with the following format:
ID | start_date | end_date
---|---|---
aaa|2016-10-26 07:00| 2016-12-15 04:00
aaa|2016-10-26 08:00| 2016-12-15 05:00
bbb|2016-10-26 07:00| 2016-11-15 03:00
ccc|2016-10-26 07:00| 2016-10-30 04:00
This table could be said to have as PK the ID
and the start_date
. (This table has ~1.5M records). We can call every row in this table a "period".
I also have a calendar that marks all the "working timeslots" in a table like the following: (This table has ~800 records)
|end_date|
|-|
|2016-10-26 07:00|
|2016-10-26 08:00|
|2016-10-26 09:00|
|...|
|2016-12-26 09:00|
Finally, I have a table with consumption data, like the following: (This table has ~2.3M records).
|ID|start_date|consumption|
|-|-|-|
|aaa|2016-10-27 07:00| 1|
|aaa|2016-10-27 08:00| 5|
|aaa|2016-10-27 09:00| 3|
|bbb|2016-10-27 07:00| 3|
What I want to get as a result is the following table:
|ID|start_date|end_date|consumption_sum|
|-|-|-|-|
|aaa|2016-10-26 07:00| 2016-12-15 04:00|14|
|bbb|2016-10-26 07:00| 2016-11-15 03:00|32|
|ccc|2016-10-26 07:00| 2016-10-30 04:00|17|
where consumption_sum = sum(consumption for each ID of all the consumption data on working days between start_date and end_date). This means: I want the sum of the consumption inside each period for each ID.
Right now, I'm making some kind of join between the first two tables, to get a big (too big) table with the expanded rows of every day of every "period". Like the following:
|ID|start_date|end_date_s|
|-|-|-|
|aaa|2016-10-26 07:00| 2016-10-26 07:00|
|aaa|2016-10-26 07:00| 2016-10-26 08:00|
|aaa|2016-10-26 07:00| 2016-10-26 09:00|
|aaa|2016-10-26 07:00| 2016-10-26 10:00|
|aaa|2016-10-26 07:00| ...|
|aaa|2016-10-26 07:00| 2016-12-15 04:00|
|bbb|2016-10-26 07:00| 2016-10-26 07:00|
|bbb|2016-10-26 07:00| 2016-10-26 08:00|
|bbb|2016-10-26 07:00| 2016-10-26 09:00|
|bbb|2016-10-26 07:00| ...|
|bbb|2016-10-26 07:00| 2016-11-15 03:00|
|ccc|2016-10-26 07:00| 2016-10-26 07:00|
|ccc|2016-10-26 07:00| ...|
then, joining this table with the third one to get the consumption of each day end_date_s
.
Finally, I sum by (ID, start_date) to get the desired table.
I'm doing this in R via the data.table function named foverlaps
which is the best I've found so far. This works correctly.
Sadly, the size of the data makes my pc run out of memory during the join of the first two tables.
I was wondering (I'm pretty sure it's possible) if it could be done better (via R or MySQL, I don't care).
The R code I'm currently using is the following (the example data is wrong, but at least you can see the code):
library(data.table)
library(magrittr)
stocks_periodo <-
structure(list(CODIGO_REFERENCIA = c("5293cb5478d6d400f0f555d531f2d63b",
"0fe0b44806573de5bde3c200455f5f03", "eb57daacff2abadf0f4551386f3c6678",
"2ead409e514f379fec7c94504f79206b", "cd0f1f709ed6631aeaf00881fc43ccad"
), PERIODO_INI = structure(c(1477512000, 1477512000, 1477512000,
1477512000, 1477512000), class = c("POSIXct", "POSIXt"), tzone = "Europe/Paris"),
PERIODO_FIN = structure(c(1477533600, 1477533600, 1477533600,
1477533600, 1477533600), class = c("POSIXct", "POSIXt"), tzone = "Europe/Paris")), class = "data.frame", .Names = c("CODIGO_REFERENCIA",
"PERIODO_INI", "PERIODO_FIN"), row.names = c(NA, -5L)) %>% data.table(key="PERIODO_INI,PERIODO_FIN")
calendario_n <-
structure(list(PERIODO_INI = structure(c(1477512000, 1477515600,
1477519200, 1477522800, 1477526400), class = c("POSIXct", "POSIXt"
), tzone = "Europe/Paris"), PERIODO_FIN = structure(c(1477512000,
1477515600, 1477519200, 1477522800, 1477526400), class = c("POSIXct",
"POSIXt"), tzone = "Europe/Paris")), .Names = c("PERIODO_INI",
"PERIODO_FIN"), row.names = c(NA, 5L), class = "data.frame") %>% data.table(key="PERIODO_INI,PERIODO_FIN")
consumos <-
structure(list(PERIODO = structure(c(1478034000, 1478037600,
1478041200, 1478044800, 1478048400), class = c("POSIXct", "POSIXt"
), tzone = ""), CODIGO_REFERENCIA = c("f3bcfd70cc0c3434d96278c0cfee1df4",
"f3bcfd70cc0c3434d96278c0cfee1df4", "f3bcfd70cc0c3434d96278c0cfee1df4",
"f3bcfd70cc0c3434d96278c0cfee1df4", "f3bcfd70cc0c3434d96278c0cfee1df4"
), DIARIO_CONSUMOS = c(8L, 8L, 8L, 8L, 8L)), class = "data.frame", .Names = c("PERIODO",
"CODIGO_REFERENCIA", "DIARIO_CONSUMOS"), row.names = c(NA, -5L
)) %>% data.table(key="CODIGO_REFERENCIA,PERIODO")
consumos_futuros<-
foverlaps(calendario_n, stocks_periodo, nomatch=0L) %>%
select(-i.PERIODO_INI, -PERIODO_FIN) %>%
rename(PERIODO_FIN= i.PERIODO_FIN) %>%
data.table(key="CODIGO_REFERENCIA,PERIODO_INI,PERIODO_FIN")
stocks_periodo %<>%
data.table(key=c("CODIGO_REFERENCIA", "PERIODO_INI", "PERIODO_FIN"))
consumos_futuros <- consumos_futuros[!stocks_periodo]
consumos_futuros %<>%
rename(PERIODO= PERIODO_FIN) %>%
data.table(key="CODIGO_REFERENCIA,PERIODO") %>%
merge(consumos) %>%
group_by(CODIGO_REFERENCIA, PERIODO_INI) %>%
summarize(CONSUMO_TOTAL= sum(DIARIO_CONSUMOS)) %>%
data.table
Thanks in advance (and sorry if this was indeed answered somewhere else).