0

I'm trying to replicate the aggregate() base function with data.table syntax in this particular scenario:

# make it reproducible
set.seed(16)

# create data.table
DT <- data.table(source = sample(letters, 100, replace = TRUE), target = sample(LETTERS, 100, replace = TRUE))
#     source target
#  1:      j      J
#  2:      d      K
#  3:      w      L
#  4:      g      J
#  ...

# aggregate using base function
aggregate(list(target = DT$target), by = list(source = DT$source), FUN = function(x) paste(x, sep = ", "))
#   source              target
#1       a          L, W, S, W
#2       b V, H, R, J, G, W, N
#3       c          Y, C, I, K
#4       d          K, A, P, V
# ...

I tried a couple of things using the data.table syntax but I didn't get it to work:

DT[, .(target = paste(target, sep = ", ")), by = source]
#     source target
#  1:      r      P
#  2:      r      I
#  3:      r      Y
#  4:      r      G
#  ...

DT[, target := paste(target, sep = ", "), by = source]
#     source target
#  1:      r      P
#  2:      g      C
#  3:      l      U
#  4:      f      J
#  ...

What's the right way to do this?

Bonus points: remove duplicate LETTERS in output (i.e.: row 1 should be L, W, S, not L, W, S, W)

Thanks!

enricoferrero
  • 2,249
  • 1
  • 23
  • 28

1 Answers1

1

If we need to get a single string of all the elements in 'target' for each 'source', use collapse argument in paste. This can be written more compactly using toString (which is paste(..., collapse=", "))

DT[, .(target = toString(target)), by = source]

Instead of pasteing into a string, we can also have a list column

DT[, .(target = list(target)), by = source]

which will be similar to the aggregate output in OP's post (though the intention seems to be different with paste)

Update

If we need only unique elements, use the unique

DT[, .(target = toString(unique(target))), by = source]

DT[, .(target = list(unique(target))), by = source]

Also, if we need to sort, wrap it with sort

DT[, .(target = toString(sort(unique(target)))), by = source]

In the OP's aggregate code, the sep is not collapsing the string into a single one, instead what we get is a list column

str(aggregate(list(target = DT$target), by = list(source = DT$source), 
         FUN = function(x) paste(x, sep = ", ")))
#'data.frame':   25 obs. of  2 variables:
# $ source: chr  "b" "c" "d" "e" ...
# $ target:List of 25
#  ..$ 01: chr  "U" "Q" "G" "C" ...
#  ..$ 02: chr  "D" "S" "G" "W"
#  ..$ 03: chr  "R" "U" "L"
#  ...
#  ...
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662