1

I want to calculate an aggregated value from a data.table and use it to replace the original values. I tried the following approach

library(data.table)
(DT <- data.table(grp  = rep(LETTERS[1:3], 3), 
                  val  = 1:9, 
                  val2 = rep(letters[24:26], each = 3)))
#    grp val val2
# 1:   A   1    x
# 2:   B   2    x
# 3:   C   3    x
# 4:   A   4    y
# 5:   B   5    y
# 6:   C   6    y
# 7:   A   7    z
# 8:   B   8    z
# 9:   C   9    z

(agg <- DT[, .SD[which.min(val)], grp])
#    grp val val2
# 1:   A   1    x
# 2:   B   2    x
# 3:   C   3    x

DT[, val3 := "New Value"]
agg[DT, on = "grp"][, .SD, .SDcols = !patterns("^i\\.")]
#    grp val val2      val3
# 1:   A   1    x New Value
# 2:   B   2    x New Value
# 3:   C   3    x New Value
# 4:   A   1    x New Value
# 5:   B   2    x New Value
# 6:   C   3    x New Value
# 7:   A   1    x New Value
# 8:   B   2    x New Value
# 9:   C   3    x New Value

While this approach works in this toy example I have a couple of concerns:

  1. I would like to avoid chaining in the first place to avoid massive copying of unused columns (my real dataset is rather big 2e5 x 200) Quote from data.table FAQ 1.10:

    We therefore strongly encourage X[Y, j] instead of X[Y]

    Putting the .SD part into the j slot in the first subset, won't work

    agg[DT, .SD, on = "grp", .SDcols = !patterns("^i\\.")]
    # Error in do_patterns(colsub, names_x) : Pattern not found: [^i\.]
    
  2. The patterns approach feels a bit hackish and relies on the assumption that data.table will always prefix the columns from Y with i.. If the package maintainers would change that for whatever reason my code would break.
  3. I am still learning data.table and I want to learn what is the most "idiomatic" way of solving this in data.table

How can I solve this problem while avoiding to make unnecessary copies to save resources?

Note. For what it matters: I do not want to change the values in DT by reference.

thothal
  • 16,690
  • 3
  • 36
  • 71
  • maybe something like `agg[DT, on=.(grp), mget(paste0("i.", names(DT)))]`. see these seminal post: https://stackoverflow.com/questions/29321218/complex-data-table-subset-and-vectorised-maniulation/29333821#29333821 and https://stackoverflow.com/questions/30468455/dynamically-build-call-for-lookup-multiple-columns/30469832#30469832 – chinsoon12 Apr 24 '20 at 23:36

2 Answers2

2

Depending on your data, this could perhaps be more efficient and names the columns explicitly:

library(data.table)
DT <- data.table(grp  = rep(LETTERS[1:3], 3), 
                  val  = 1:9, 
                  val2 = rep(letters[24:26], each = 3))
agg <- DT[, .SD[which.min(val)], grp]
DT[, val3 := "New Value"]

repcols <- setdiff(colnames(agg), "grp")
DT[, (repcols) := agg[DT, .SD, on = .(grp), .SDcols=repcols]][]
#>    grp val val2      val3
#> 1:   A   1    x New Value
#> 2:   B   2    x New Value
#> 3:   C   3    x New Value
#> 4:   A   1    x New Value
#> 5:   B   2    x New Value
#> 6:   C   3    x New Value
#> 7:   A   1    x New Value
#> 8:   B   2    x New Value
#> 9:   C   3    x New Value

Edit: Based on the follow-up question, changing by reference could e.g. be directly achieved like this:

library(data.table)
DT <- data.table(grp  = rep(LETTERS[1:3], 3), 
                 val  = 1:9, 
                 val2 = rep(letters[24:26], each = 3))

keepcols <- setdiff(colnames(DT), "grp") 
DT[, val3 := letters[1:9]]
# if you want to keep all of val3; otherwise switch the previous two lines

DT[, (keepcols) := .SD[which.min(val)], by=.(grp), .SDcols=keepcols][]
#>    grp val val2 val3
#> 1:   A   1    x    a
#> 2:   B   2    x    b
#> 3:   C   3    x    c
#> 4:   A   1    x    d
#> 5:   B   2    x    e
#> 6:   C   3    x    f
#> 7:   A   1    x    g
#> 8:   B   2    x    h
#> 9:   C   3    x    i

Created on 2020-04-24 by the reprex package (v0.3.0)

user12728748
  • 8,106
  • 2
  • 9
  • 14
  • Nice, but as said I wanted to avoid changing by reference. – thothal Apr 24 '20 at 14:32
  • Sorry - I thought this applied to explicitly joining the previously created agg data.table instead of replacing the values straight away (otherwise it would be way easier not to create that in the first place). – user12728748 Apr 24 '20 at 14:39
  • Could you elaborate on "otherwise it would be way easier not to create that in the first place)"? – thothal Apr 24 '20 at 15:57
  • Makes sense now to skip the `agg` altogether. The more I think of it, maybe I should indeed change by reference. I guess I should started to get used to that idea (difficult to change habits). – thothal Apr 27 '20 at 07:14
1

Assigning by reference would be pretty idiomatic. Why do you indicate you are not interested in this method?

DT[agg, on = "grp", val3 := val3]

Or, if agg is truly an aggregate of DT, then it could have been:

DT[, val4 := min(val), by = grp]

      grp   val   val2      val3  val4
   <char> <int> <char>    <char> <int>
1:      A     1      x New Value     1
2:      B     2      x New Value     2
3:      C     3      x New Value     3
4:      A     4      y New Value     1
5:      B     5      y New Value     2
6:      C     6      y New Value     3
7:      A     7      z New Value     1
8:      B     8      z New Value     2
9:      C     9      z New Value     3
Cole
  • 11,130
  • 1
  • 9
  • 24
  • Thanks for your answer. But if I am getting your code right we are adding new columns to the original data table by reference. I want to have a data table where the original values are replaced by the respective aggregated values, like in the answer below, but without using by ref semantics. – thothal Apr 27 '20 at 07:09
  • If you do not want to modify the original, you can do ```copy(DT)[, newval := 1]```. This is for illustration, you could change to whatever you needed in ```DT[...]``` – Cole Apr 27 '20 at 10:39