Replace values in data.table by an aggregated value

Question

I want to calculate an aggregated value from a data.table and use it to replace the original values. I tried the following approach

library(data.table)
(DT <- data.table(grp  = rep(LETTERS[1:3], 3), 
                  val  = 1:9, 
                  val2 = rep(letters[24:26], each = 3)))
#    grp val val2
# 1:   A   1    x
# 2:   B   2    x
# 3:   C   3    x
# 4:   A   4    y
# 5:   B   5    y
# 6:   C   6    y
# 7:   A   7    z
# 8:   B   8    z
# 9:   C   9    z

(agg <- DT[, .SD[which.min(val)], grp])
#    grp val val2
# 1:   A   1    x
# 2:   B   2    x
# 3:   C   3    x

DT[, val3 := "New Value"]
agg[DT, on = "grp"][, .SD, .SDcols = !patterns("^i\\.")]
#    grp val val2      val3
# 1:   A   1    x New Value
# 2:   B   2    x New Value
# 3:   C   3    x New Value
# 4:   A   1    x New Value
# 5:   B   2    x New Value
# 6:   C   3    x New Value
# 7:   A   1    x New Value
# 8:   B   2    x New Value
# 9:   C   3    x New Value

While this approach works in this toy example I have a couple of concerns:

I would like to avoid chaining in the first place to avoid massive copying of unused columns (my real dataset is rather big 2e5 x 200) Quote from data.table FAQ 1.10:

We therefore strongly encourage X[Y, j] instead of X[Y]

Putting the .SD part into the j slot in the first subset, won't work
```
agg[DT, .SD, on = "grp", .SDcols = !patterns("^i\\.")]
# Error in do_patterns(colsub, names_x) : Pattern not found: [^i\.]
```
The patterns approach feels a bit hackish and relies on the assumption that data.table will always prefix the columns from Y with i.. If the package maintainers would change that for whatever reason my code would break.
I am still learning data.table and I want to learn what is the most "idiomatic" way of solving this in data.table

How can I solve this problem while avoiding to make unnecessary copies to save resources?

Note. For what it matters: I do not want to change the values in DT by reference.

maybe something like `agg[DT, on=.(grp), mget(paste0("i.", names(DT)))]`. see these seminal post: https://stackoverflow.com/questions/29321218/complex-data-table-subset-and-vectorised-maniulation/29333821#29333821 and https://stackoverflow.com/questions/30468455/dynamically-build-call-for-lookup-multiple-columns/30469832#30469832 — chinsoon12, Apr 24 '20 at 23:36

user12728748 · Accepted Answer · 2020-04-24T16:39:23.230

Depending on your data, this could perhaps be more efficient and names the columns explicitly:

library(data.table)
DT <- data.table(grp  = rep(LETTERS[1:3], 3), 
                  val  = 1:9, 
                  val2 = rep(letters[24:26], each = 3))
agg <- DT[, .SD[which.min(val)], grp]
DT[, val3 := "New Value"]

repcols <- setdiff(colnames(agg), "grp")
DT[, (repcols) := agg[DT, .SD, on = .(grp), .SDcols=repcols]][]
#>    grp val val2      val3
#> 1:   A   1    x New Value
#> 2:   B   2    x New Value
#> 3:   C   3    x New Value
#> 4:   A   1    x New Value
#> 5:   B   2    x New Value
#> 6:   C   3    x New Value
#> 7:   A   1    x New Value
#> 8:   B   2    x New Value
#> 9:   C   3    x New Value

Edit: Based on the follow-up question, changing by reference could e.g. be directly achieved like this:

library(data.table)
DT <- data.table(grp  = rep(LETTERS[1:3], 3), 
                 val  = 1:9, 
                 val2 = rep(letters[24:26], each = 3))

keepcols <- setdiff(colnames(DT), "grp") 
DT[, val3 := letters[1:9]]
# if you want to keep all of val3; otherwise switch the previous two lines

DT[, (keepcols) := .SD[which.min(val)], by=.(grp), .SDcols=keepcols][]
#>    grp val val2 val3
#> 1:   A   1    x    a
#> 2:   B   2    x    b
#> 3:   C   3    x    c
#> 4:   A   1    x    d
#> 5:   B   2    x    e
#> 6:   C   3    x    f
#> 7:   A   1    x    g
#> 8:   B   2    x    h
#> 9:   C   3    x    i

^{Created on 2020-04-24 by the reprex package (v0.3.0)}

Sorry - I thought this applied to explicitly joining the previously created agg data.table instead of replacing the values straight away (otherwise it would be way easier not to create that in the first place). — user12728748, Apr 24 '20 at 14:39
Could you elaborate on "otherwise it would be way easier not to create that in the first place)"? — thothal, Apr 24 '20 at 15:57
Makes sense now to skip the `agg` altogether. The more I think of it, maybe I should indeed change by reference. I guess I should started to get used to that idea (difficult to change habits). — thothal, Apr 27 '20 at 07:14

score 1 · Answer 2 · answered Apr 25 '20 at 11:44

1

Assigning by reference would be pretty idiomatic. Why do you indicate you are not interested in this method?

DT[agg, on = "grp", val3 := val3]

Or, if agg is truly an aggregate of DT, then it could have been:

DT[, val4 := min(val), by = grp]

      grp   val   val2      val3  val4
   <char> <int> <char>    <char> <int>
1:      A     1      x New Value     1
2:      B     2      x New Value     2
3:      C     3      x New Value     3
4:      A     4      y New Value     1
5:      B     5      y New Value     2
6:      C     6      y New Value     3
7:      A     7      z New Value     1
8:      B     8      z New Value     2
9:      C     9      z New Value     3

answered Apr 25 '20 at 11:44

Cole

11,130
1
9
24

Thanks for your answer. But if I am getting your code right we are adding new columns to the original data table by reference. I want to have a data table where the original values are replaced by the respective aggregated values, like in the answer below, but without using by ref semantics. – thothal Apr 27 '20 at 07:09
If you do not want to modify the original, you can do ```copy(DT)[, newval := 1]```. This is for illustration, you could change to whatever you needed in ```DT[...]``` – Cole Apr 27 '20 at 10:39

Replace values in data.table by an aggregated value

2 Answers2