r data.table function wrapper around ad-hoc join (with aggregation in a chain)

Question

[data.table_1.9.6] background of the question is that I am trying to build olap-like query features in a star-schema-like data layout, i.e. a large fact table and several meta tables. I am building a function wrapper around data.table join followed by an aggregation in a chain as so:

# dummy data
dt1 = data.table(id = 1:5, x=letters[1:5], a=11:15, b=21:25)
dt2 = data.table(k=11:15, z=letters[11:15])

# standard data.table query with ad-hoc key -> works fine
dt1[dt2, c("z") := .(i.z), with = F, 
    on = c(a="k")][, .(m = sum(a, na.rm = T),
                   count = .N), by = c("z")]

# wrapper function with setkey -> works fine
agg_foo <- function(x, meta_tbl, x_key, meta_key, agg_var) { 
  setkeyv(x, x_key)
  setkeyv(meta_tbl, meta_key)
  x[meta_tbl, (agg_var) := get(agg_var)][,.(a_sum = sum(a, na.rm=T),
                                            count = .N), 
                                         by = c(agg_var)]
  x[, (agg_var) := .(NULL)]
  }

# call function (works fine)
agg_foo(x=dt1, meta_tbl=dt2, x_key="a", meta_key="k",agg_var="z")

# wrapper function with ad-hoc key -> does not work
agg_foo_ad_hoc <- function(x, meta_tbl, x_key, meta_key, agg_var) { 
  x[meta_tbl, (agg_var) := get(agg_var), 
    on = c(x_key = meta_key)][,.(a_sum = sum(a, na.rm=T),
                                 count = .N), by = c(agg_var)]
  x[, (agg_var) := .(NULL)]
  }

# call function (causes error)
agg_foo_ad_hoc(x=dt1, meta_tbl=dt2, x_key="a", meta_key="k",agg_var="z")

Error in forderv(x, by = rightcols) : 
  'by' value -2147483648 out of range [1,4]

my guess is that I have to provide the ad-hoc "on" parameter in a different way. I tried on = c(get(x_key) = meta_key) but then he is complaining about unexpected brackets. I could go with setkey version of the function that works but I wonder if this is efficient given that the function will work on different meta tables depending on which attribute for the aggregation is used and thus constantly re-setting the key. or is the setkey always to be preferred? actual fact table (x here) has > 30 mln rows.

You may want to check [data.cube](https://github.com/jangorecki/data.cube) package which defines OLAP cube R data type, based on set of data.tables. In the package there is `cube` class which is star schema modeled. In devel branch *data.cube-oop* there is a new `data.cube` class which is a mix of star and snowflake schema, described in [this SO question](https://stackoverflow.com/questions/35472639/star-schema-normalized-dimensions-denormalized-hierarchy-level-keys). It automatically sub-aggregate to available dimensions and keeps the data in cube normalized. — jangorecki, Jun 08 '16 at 15:49
@jangorecki: I know about your package. already looked into it in the past. thanks for the tip anyway! I would also gladly go through its github source if it wasn't blocked by corporate IT. but is there a data.table only implementation? — Triamus, Jun 08 '16 at 15:56
The actual upstream repo is on [gitlab.com/jangorecki/data.cube](https://gitlab.com/jangorecki/data.cube), so you may try there. Implementation is based on `data.table` and [big.data.table](https://github.com/jangorecki/big.data.table) for fact table (and only partially for `cube` class so far). The latter one allows to distribute fact table over multiple machines, removing the memory limits and allowing parallelized queries, capable to process OLAP big data efficiently. — jangorecki, Jun 08 '16 at 16:03

score 2 · Accepted Answer · answered Jun 08 '16 at 17:00

2

All you need to do is construct a vector with correct labels. Here's one way of doing that:

agg_foo_ad_hoc <- function(x, meta_tbl, x_key, meta_key, agg_var) { 
  x[meta_tbl, (agg_var) := get(agg_var), 
    on = setNames(meta_key, x_key)][,.(a_sum = sum(a, na.rm=T),
                                       count = .N), by = c(agg_var)]
  x[, (agg_var) := .(NULL)]
}

answered Jun 08 '16 at 17:00

eddi

49,088
6
104
155

do you have an opinion on function design, i.e. whether setkey is to be preferred over ad-hoc key? – Triamus Jun 08 '16 at 20:58
2

@Triam nowadays I'm using `on` almost exclusively. It's more flexible, and I like knowing what I'm doing within each operation, as opposed to having used `setkey` in some other part of the code and then having to remember that I'd done that. The only case where I'd use `setkey` now, is if I need to do multiple joins/lookups in a row on the same key. – eddi Jun 08 '16 at 21:46

r data.table function wrapper around ad-hoc join (with aggregation in a chain)

1 Answers1