2

I have a data table with this schema:

date, key_a, key_b, key_c, key_d, value

I'd like to make a list of lists that has this structure:

[1]
date, key_a, value

[2] 
date, key_b, value

[3]
date, key_c, value

I'd like to aggregate my dt into a list where each of the entries is aggregating the dt across a separate one of the keys.

This is my code

setDT(dt)

list_of_dts[1] <- 
    dt[, .(value = sum(value)), .(date, key_a)]

list_of_dts[2] <- 
    dt[, .(value = sum(value)), .(date, key_b)]

(So on)

Is there a more efficient way of solving this?

akrun
  • 874,273
  • 37
  • 540
  • 662
Cauder
  • 2,157
  • 4
  • 30
  • 69
  • Cauder, you are concerned with performance with large tables, certainly. The speed and efficiency that `data.table` provides is useful when the data resides in one place; however, when splitting a frame, it is taking the single-place object and splitting it into multiple objects, which usually involves copying of data (which is what `data.table` does only as a last resort). Most of your data is being copied *by definition* when you have `"date"` and `"value"` in all of the listified-tables. – r2evans Sep 10 '20 at 04:58

3 Answers3

3

Maybe get the data in long format and then summarise it :

library(data.table)
setDT(dt)
dt1 <- melt(dt, id.vars = c('date', 'value'))
dt1 <- dt1[, .(value = sum(value)), .(date, variable)]

Now if you want list of dataframes you can use split :

split(dt1, dt1$variable)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Does `split` take advantage of data tables speed advantage? – Cauder Sep 10 '20 at 04:54
  • @Cauder - for a while now, there is a specific `split.data.table` function that I believe is optimised - see https://stackoverflow.com/questions/14977997/split-data-table – thelatemail Sep 10 '20 at 04:57
  • Also consider if you really want to split your data at the first place. What are you trying to achieve by doing that? – Ronak Shah Sep 10 '20 at 04:59
  • I have a single column that I want to separately aggregate across a bunch of other columns. The computation will be expensive b/c it's a 50GB – Cauder Sep 10 '20 at 14:02
1

We can use tidyverse

library(dplyr)
library(tidyr)
dt %>%
    pivot_longer(cols = starts_with('key'), values_to = 'value1') %>%
    group_by(date, name) %>%
     summarise(value = sum(value1))

An option is to convert to disk.frame and do the group by operations

library(disk.frame)
dt %>%
    pivot_longer(cols = starts_with('key'), values_to = 'value1') %>%
    as.disk.frame %>%
    group_by(date, name) %>%
    summarise(value = sum(value1)) %>%
    collect()

When there are multiple .csv files it can be directly read with csv_to_disk.frame

df  <- csv_to_disk.frame(file.path(tempdir(), "df.csv"), 
  inmapfn = function(chunk) {
   # convert to date_str to date format and store as "date"
       chunk[, date := as.Date(date_str, "%Y-%m-%d")]
       chunk[, date_str:= NULL]
       chunk[, new  := col1 + 5]
    })
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I'm concerned about speed because my data table has 800M rows. Can I do this in a way that takes advantage of data table? – Cauder Sep 10 '20 at 04:39
  • @Cauder consider using [disk.frame](https://diskframe.com/articles/vs-dask-juliadb.html) it is fast – akrun Sep 10 '20 at 05:28
1

Try this data.table-native attempt:

dt <- data.table(date=c(1,1,2), key_a=c(11,11,13), key_b=c(21,21,23), key_c=c(31,31,33), key_d=c(41,41,43), value=c(51,51,53))
keynames <- grep("^key", colnames(dt), value = TRUE)
othnames <- setdiff(colnames(dt), keynames)
keynames
# [1] "key_a" "key_b" "key_c" "key_d"
othnames
# [1] "date"  "value"

The split, for raw data:

lapply(setNames(nm = keynames), function(kn) subset(dt, select = c(othnames, kn)))
# $key_a
#    date value key_a
# 1:    1    51    11
# 2:    1    51    11
# 3:    2    53    13
# $key_b
#    date value key_b
# 1:    1    51    21
# 2:    1    51    21
# 3:    2    53    23
# $key_c
#    date value key_c
# 1:    1    51    31
# 2:    1    51    31
# 3:    2    53    33
# $key_d
#    date value key_d
# 1:    1    51    41
# 2:    1    51    41
# 3:    2    53    43

Or for your aggregation issue:

lapply(setNames(nm = keynames), function(kn) dt[, .(value = sum(value)), by = c("date", kn)])
# $key_a
#    date key_a value
# 1:    1    11   102
# 2:    2    13    53
# $key_b
#    date key_b value
# 1:    1    21   102
# 2:    2    23    53
# $key_c
#    date key_c value
# 1:    1    31   102
# 2:    2    33    53
# $key_d
#    date key_d value
# 1:    1    41   102
# 2:    2    43    53
r2evans
  • 141,215
  • 6
  • 77
  • 149