Summarize data efficiently over several ranges of dates using MySQL and/ or R

Question

I've been trying (and searching for) a way to a sum over ranges of dates for every ID. I've looked here and everywhere for an SQL syntax to do this, since it seems to me this should be a fairly common problem, but I've not found something that does exactly what I want, for example this.

I have a table with the following format:

ID | start_date | end_date
---|---|---
aaa|2016-10-26 07:00| 2016-12-15 04:00
aaa|2016-10-26 08:00| 2016-12-15 05:00
bbb|2016-10-26 07:00| 2016-11-15 03:00
ccc|2016-10-26 07:00| 2016-10-30 04:00

This table could be said to have as PK the ID and the start_date. (This table has ~1.5M records). We can call every row in this table a "period".

I also have a calendar that marks all the "working timeslots" in a table like the following: (This table has ~800 records)

|end_date|
|-|
|2016-10-26 07:00|
|2016-10-26 08:00|
|2016-10-26 09:00|
|...|
|2016-12-26 09:00|

Finally, I have a table with consumption data, like the following: (This table has ~2.3M records).

|ID|start_date|consumption|
|-|-|-|
|aaa|2016-10-27 07:00| 1|
|aaa|2016-10-27 08:00| 5|
|aaa|2016-10-27 09:00| 3|
|bbb|2016-10-27 07:00| 3|

What I want to get as a result is the following table:

|ID|start_date|end_date|consumption_sum|
|-|-|-|-|
|aaa|2016-10-26 07:00| 2016-12-15 04:00|14|
|bbb|2016-10-26 07:00| 2016-11-15 03:00|32|
|ccc|2016-10-26 07:00| 2016-10-30 04:00|17|

where consumption_sum = sum(consumption for each ID of all the consumption data on working days between start_date and end_date). This means: I want the sum of the consumption inside each period for each ID.

Right now, I'm making some kind of join between the first two tables, to get a big (too big) table with the expanded rows of every day of every "period". Like the following:

|ID|start_date|end_date_s|
|-|-|-|
|aaa|2016-10-26 07:00| 2016-10-26 07:00|
|aaa|2016-10-26 07:00| 2016-10-26 08:00|
|aaa|2016-10-26 07:00| 2016-10-26 09:00|
|aaa|2016-10-26 07:00| 2016-10-26 10:00|
|aaa|2016-10-26 07:00| ...|
|aaa|2016-10-26 07:00| 2016-12-15 04:00|
|bbb|2016-10-26 07:00| 2016-10-26 07:00|
|bbb|2016-10-26 07:00| 2016-10-26 08:00|
|bbb|2016-10-26 07:00| 2016-10-26 09:00|
|bbb|2016-10-26 07:00| ...|
|bbb|2016-10-26 07:00| 2016-11-15 03:00|
|ccc|2016-10-26 07:00| 2016-10-26 07:00|
|ccc|2016-10-26 07:00| ...|

then, joining this table with the third one to get the consumption of each day end_date_s.

Finally, I sum by (ID, start_date) to get the desired table.

I'm doing this in R via the data.table function named foverlaps which is the best I've found so far. This works correctly.

Sadly, the size of the data makes my pc run out of memory during the join of the first two tables.

I was wondering (I'm pretty sure it's possible) if it could be done better (via R or MySQL, I don't care).

The R code I'm currently using is the following (the example data is wrong, but at least you can see the code):

library(data.table)
library(magrittr)

  stocks_periodo <-
    structure(list(CODIGO_REFERENCIA = c("5293cb5478d6d400f0f555d531f2d63b", 
    "0fe0b44806573de5bde3c200455f5f03", "eb57daacff2abadf0f4551386f3c6678", 
    "2ead409e514f379fec7c94504f79206b", "cd0f1f709ed6631aeaf00881fc43ccad"
    ), PERIODO_INI = structure(c(1477512000, 1477512000, 1477512000, 
    1477512000, 1477512000), class = c("POSIXct", "POSIXt"), tzone = "Europe/Paris"), 
        PERIODO_FIN = structure(c(1477533600, 1477533600, 1477533600, 
        1477533600, 1477533600), class = c("POSIXct", "POSIXt"), tzone = "Europe/Paris")), class = "data.frame", .Names = c("CODIGO_REFERENCIA", 
    "PERIODO_INI", "PERIODO_FIN"), row.names = c(NA, -5L)) %>% data.table(key="PERIODO_INI,PERIODO_FIN")


  calendario_n <- 
    structure(list(PERIODO_INI = structure(c(1477512000, 1477515600, 
    1477519200, 1477522800, 1477526400), class = c("POSIXct", "POSIXt"
    ), tzone = "Europe/Paris"), PERIODO_FIN = structure(c(1477512000, 
    1477515600, 1477519200, 1477522800, 1477526400), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Paris")), .Names = c("PERIODO_INI", 
    "PERIODO_FIN"), row.names = c(NA, 5L), class = "data.frame") %>% data.table(key="PERIODO_INI,PERIODO_FIN")

  consumos <- 
    structure(list(PERIODO = structure(c(1478034000, 1478037600, 
    1478041200, 1478044800, 1478048400), class = c("POSIXct", "POSIXt"
    ), tzone = ""), CODIGO_REFERENCIA = c("f3bcfd70cc0c3434d96278c0cfee1df4", 
    "f3bcfd70cc0c3434d96278c0cfee1df4", "f3bcfd70cc0c3434d96278c0cfee1df4", 
    "f3bcfd70cc0c3434d96278c0cfee1df4", "f3bcfd70cc0c3434d96278c0cfee1df4"
    ), DIARIO_CONSUMOS = c(8L, 8L, 8L, 8L, 8L)), class = "data.frame", .Names = c("PERIODO", 
    "CODIGO_REFERENCIA", "DIARIO_CONSUMOS"), row.names = c(NA, -5L
    )) %>% data.table(key="CODIGO_REFERENCIA,PERIODO")

  consumos_futuros<-
    foverlaps(calendario_n, stocks_periodo, nomatch=0L) %>%
    select(-i.PERIODO_INI, -PERIODO_FIN) %>% 
    rename(PERIODO_FIN= i.PERIODO_FIN) %>% 
    data.table(key="CODIGO_REFERENCIA,PERIODO_INI,PERIODO_FIN")

  stocks_periodo %<>% 
    data.table(key=c("CODIGO_REFERENCIA", "PERIODO_INI", "PERIODO_FIN"))

  consumos_futuros <- consumos_futuros[!stocks_periodo]

  consumos_futuros %<>% 
    rename(PERIODO= PERIODO_FIN) %>% 
    data.table(key="CODIGO_REFERENCIA,PERIODO") %>% 
    merge(consumos) %>% 
    group_by(CODIGO_REFERENCIA, PERIODO_INI) %>% 
    summarize(CONSUMO_TOTAL= sum(DIARIO_CONSUMOS)) %>% 
    data.table

Thanks in advance (and sorry if this was indeed answered somewhere else).

@JasonAizkalns Thanks, yea I just figured that out. OP specified dplyr not magrittr and need to edit his question to include library statements. — Hack-R, Oct 31 '16 at 18:15
Have you read [this post](http://stackoverflow.com/q/24480031/2572423)? — JasonAizkalns, Oct 31 '16 at 18:18
@JasonAizkalns Thanks. Yes: I think I had reached that post also. I think I'm using the same function they describe there. I'm looking for an alternative way to to that (or a way to use the same function without the memory running out). — pchtsp, Oct 31 '16 at 18:27
Would it be possible to separate the data, for example by days or number of days? This would probably at least make it possible to get results. — hannes101, Nov 02 '16 at 09:52
@eddi I will test with Arun's second answer when I get some time and report. Altough it seemed to me it was just a more convenient way to do the same as ```foverlaps```. A more general doubt I had was if it made sense to do this type of semi-complex wranginlig inside MySQL instead of doing it directly in R: I'd think that there's exist some more intuitive and efficient sql formulation of what I want. @hannes101 I guess I could do it in a loop in blocks of dates but then it gets less intuitive and harder to mantain. — pchtsp, Nov 02 '16 at 17:36

score 1 · Accepted Answer · answered Nov 08 '16 at 16:59

Not sure if this really solves the "general" situation, but at least it solved mine.

As the first table shows, there were a lot of rolling periods for the same ID. These were constructed by summing a fix quantity of hours (based on the ID) to set of consecutive start_date's to generate a set of end_date's.

Not sure why I did not think of this out sooner but the zoo package has rollsumand rollapply functions. This do exactly what I wanted: a rolling sum of a given size.

So instead of summing arbitrary periods defined by a start_date and end_date, I summed a number of rows given by a size parameter dependant on the ID.

I only had to be careful in having a row for every working timeslot, regardless of wether it had consumption or not.

As an example of what I ended doing I leave the original code edited, I added a new table referencias that includes the amount of working timeslots to sum consumption (STOCK_HORAS).

library(data.table)
library(magrittr)
library(dplyr)
library(zoo)
library(tidyr)

stocks_periodo <-
  structure(list(CODIGO_REFERENCIA = c("5293cb5478d6d400f0f555d531f2d63b", 
                                       "0fe0b44806573de5bde3c200455f5f03", "eb57daacff2abadf0f4551386f3c6678", 
                                       "2ead409e514f379fec7c94504f79206b", "cd0f1f709ed6631aeaf00881fc43ccad"
  ), PERIODO_INI = structure(c(1477512000, 1477512000, 1477512000, 
                               1477512000, 1477512000), class = c("POSIXct", "POSIXt"), tzone = "Europe/Paris"), 
  PERIODO_FIN = structure(c(1477533600, 1477533600, 1477533600, 
                            1477533600, 1477533600), class = c("POSIXct", "POSIXt"), tzone = "Europe/Paris")), class = "data.frame", .Names = c("CODIGO_REFERENCIA", 
                                                                                                                                                "PERIODO_INI", "PERIODO_FIN"), row.names = c(NA, -5L)) %>% data.table(key="PERIODO_INI,PERIODO_FIN")


calendario_n <- 
  structure(list(PERIODO_INI = structure(c(1477512000, 1477515600, 
                                           1477519200, 1477522800, 1477526400), class = c("POSIXct", "POSIXt"
                                           ), tzone = "Europe/Paris"), PERIODO_FIN = structure(c(1477512000, 
                                                                                                 1477515600, 1477519200, 1477522800, 1477526400), class = c("POSIXct", 
                                                                                                                                                            "POSIXt"), tzone = "Europe/Paris")), .Names = c("PERIODO_INI", 
                                                                                                                                                                                                            "PERIODO_FIN"), row.names = c(NA, 5L), class = "data.frame") %>% data.table(key="PERIODO_INI,PERIODO_FIN")

consumos <- 
  structure(list(PERIODO = structure(c(1478034000, 1478037600, 
                                       1478041200, 1478044800, 1478048400), class = c("POSIXct", "POSIXt"
                                       ), tzone = ""), CODIGO_REFERENCIA = c("f3bcfd70cc0c3434d96278c0cfee1df4", 
                                                                             "f3bcfd70cc0c3434d96278c0cfee1df4", "f3bcfd70cc0c3434d96278c0cfee1df4", 
                                                                             "f3bcfd70cc0c3434d96278c0cfee1df4", "f3bcfd70cc0c3434d96278c0cfee1df4"
                                       ), DIARIO_CONSUMOS = c(8L, 8L, 8L, 8L, 8L)), class = "data.frame", .Names = c("PERIODO", 
                                                                                                                     "CODIGO_REFERENCIA", "DIARIO_CONSUMOS"), row.names = c(NA, -5L
                                                                                                                     )) %>% data.table(key="CODIGO_REFERENCIA,PERIODO")
referencias <- structure(list(CODIGO_REFERENCIA = c("3bed628bf8f6242c28d88200faa7e869", 
                                                    "cb3b6727071ec6659c712c3ec99c873a", "84bf5c06deaa2e42ae7edf6055b490db", 
                                                    "bb10cc5b0c3e127bd2073336365bf0e5", "b8b71160125f95a104e24878ff651e9c", 
                                                    "564169fbe71a04b31bb8e141be3fac66"), STOCK_HORAS = c(14, 14, 
                                                                                                         14, 18, 18, 14)), .Names = c("CODIGO_REFERENCIA", "STOCK_HORAS"
                                                                                                         ), class = c("data.table", "data.frame"), row.names = c(NA, -6L
                                                                                                         )) %>% data.table

calendario_refs <- 
  CJ(PERIODO= calendario_n %>% select(PERIODO=PERIODO_INI) %>% first,
     CODIGO_REFERENCIA= referencias %>% select(CODIGO_REFERENCIA) %>% first)

# por alguna razón rollsum va rápido pero no tiene partial =TRUE. Esto hace que
# valga la pena partir los cálculos en 2 y luego juntarlos.

rollsum_1 <-
  calendario_refs %>% 
  merge(consumos, by=c("CODIGO_REFERENCIA", "PERIODO"), all.x=TRUE) %>% 
  merge(referencias, by="CODIGO_REFERENCIA") %>% 
  replace_na(list(DIARIO_CONSUMOS=0)) %>% 
  select(CODIGO_REFERENCIA, PERIODO, STOCK_HORAS, DIARIO_CONSUMOS) %>% 
  arrange(CODIGO_REFERENCIA, PERIODO) %>% 
  group_by(CODIGO_REFERENCIA) %>%
  # si tenemos menos periodos que la suma móvil: reajustamos a lo máximo
  mutate(STOCK_HORAS= pmin(STOCK_HORAS, n())) %>%
  mutate(CONSUMO_TOTAL= rollsum(DIARIO_CONSUMOS, first(STOCK_HORAS), align = "left", fill=NA)) %>%
  data.table

Summarize data efficiently over several ranges of dates using MySQL and/ or R

1 Answers1