0

I'm trying to use a function that calculates some statistics by group in a data.table as follows:

minmax <- function(vec) {
  c(min(vec), max(vec))
}
library(data.table)
iris <- as.data.table(iris)
iris[, c('Min', 'Max') := minmax(Petal.Length), by = Species]

The result should be the min and max Petal.Length by Species and have as many rows as there are Species. That is, the same result as the code below:

merge(
  iris[, .(Min = min(Petal.Length)), Species],
  iris[, .(Max = max(Petal.Length)), Species],
  on = 'Species'
)

      Species Min Max
1:     setosa 1.0 1.9
2: versicolor 3.0 5.1
3:  virginica 4.5 6.9

Note: in my own code I want to do this in one go, not using merge().

r2evans
  • 141,215
  • 6
  • 77
  • 149
Bob Jansen
  • 1,215
  • 2
  • 12
  • 31
  • Related: [Create aggregate output data.table from function returning multiple output](https://stackoverflow.com/questions/25425121/create-aggregate-output-data-table-from-function-returning-multiple-output) – Henrik Dec 28 '21 at 18:29
  • Thank you @Henrik, I struggled momentarily to find the dupe (and then gave up looking). – r2evans Dec 28 '21 at 18:32
  • 1
    I don't mind closing this question as a dupe but I think your answer definitely has merit compared to the original one. – Bob Jansen Dec 28 '21 at 18:33

1 Answers1

3

There are a few things here:

  1. The := is for adding columns to an existing frame, so it will not summarize as you've demonstrated. That is, DT[, a := b] should always have exactly the same number of rows. I think this is not what you need.

  2. You can do multiple-assignment in your summarizing code and do away with the merge (perhaps this is similar to the code you said you do "in one go"):

    iris[, .(Min = min(Petal.Length), Max = max(Petal.Length)), by = .(Species)]
    #       Species   Min   Max
    #        <fctr> <num> <num>
    # 1:     setosa   1.0   1.9
    # 2: versicolor   3.0   5.1
    # 3:  virginica   4.5   6.9
    
  3. But finally, you are asking how to use your function to get this. First attempts might be:

    minmax <- function(vec) c(min(vec), max(vec))
    iris[, minmax(Petal.Length), by = .(Species)]
    #       Species    V1
    #        <fctr> <num>
    # 1:     setosa   1.0
    # 2:     setosa   1.9
    # 3: versicolor   3.0
    # 4: versicolor   5.1
    # 5:  virginica   4.5
    # 6:  virginica   6.9
    
    iris[, as.list(minmax(Petal.Length)), by = .(Species)]
    #       Species    V1    V2
    #        <fctr> <num> <num>
    # 1:     setosa   1.0   1.9
    # 2: versicolor   3.0   5.1
    # 3:  virginica   4.5   6.9
    
    iris[, setNames(as.list(minmax(Petal.Length)), c("Min", "Max")), by = .(Species)]
    #       Species   Min   Max
    #        <fctr> <num> <num>
    # 1:     setosa   1.0   1.9
    # 2: versicolor   3.0   5.1
    # 3:  virginica   4.5   6.9
    
    minmax <- function(vec) c(Min = min(vec), Max = max(vec))
    iris[, as.list(minmax(Petal.Length)), by = .(Species)]
    #       Species   Min   Max
    #        <fctr> <num> <num>
    # 1:     setosa   1.0   1.9
    # 2: versicolor   3.0   5.1
    # 3:  virginica   4.5   6.9
    

    So we can change the function to return a list (optionally named).

    minmax <- function(vec) list(min(vec), max(vec))
    iris[, minmax(Petal.Length), by = .(Species)]
    #       Species    V1    V2
    #        <fctr> <num> <num>
    # 1:     setosa   1.0   1.9
    # 2: versicolor   3.0   5.1
    # 3:  virginica   4.5   6.9
    iris[, setNames(minmax(Petal.Length), c("Min", "Max")), by = .(Species)]
    #       Species   Min   Max
    #        <fctr> <num> <num>
    # 1:     setosa   1.0   1.9
    # 2: versicolor   3.0   5.1
    # 3:  virginica   4.5   6.9
    
    minmax <- function(vec) list(Min = min(vec), Max = max(vec))
    iris[, minmax(Petal.Length), by = .(Species)]
    #       Species   Min   Max
    #        <fctr> <num> <num>
    # 1:     setosa   1.0   1.9
    # 2: versicolor   3.0   5.1
    # 3:  virginica   4.5   6.9
    
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • This will work for me, thanks! It's not possible to specify the column names inside the `[`? – Bob Jansen Dec 28 '21 at 18:11
  • There are several options in this answer that allow you to set names on the return value given the function does not name them itself. See the use of `setNames(as.list(...))`, recently edited in. – r2evans Dec 28 '21 at 18:12
  • The key for both *summarizing* and *naming* is to produce a `list` (`as.list`) and use `setNames` without actual `:=`-assignment. – r2evans Dec 28 '21 at 18:23
  • 1
    Yes, now I see. I was stuck on trying to use a vector. Using a vector was not a great idea in any case because it can't handle multiple types. – Bob Jansen Dec 28 '21 at 18:25
  • 3
    @BobJansen Just remember the basic property of `j`; straight from the help text: "As long as `j` returns a `list`, each element of the list becomes a column in the resulting `data.table`" Cheers – Henrik Dec 28 '21 at 18:48