1

I'm trying to summarize a data frame, while grouping by a variable. My problem is that when doing such summarizing procedure, I lose other columns that I need.

Consider the following data:

df <- 
  tibble::tribble(
    ~id, ~year, ~my_value,
    1,   2010,  2,
    1,   2013,  2,
    1,   2014,  2,
    2,   2010,  4,
    2,   2012,  3,
    2,   2014,  4,
    2,   2015,  2,
    3,   2015,  3,
    3,   2010,  3,
    3,   2011,  3
  )

I want to group by id in order to collapse my_value to a single value. I use the following algorithm:

  1. IF all values of my_value are identical, then simply return the first value, i.e, my_value[1].
  2. ELSE return the smallest value, i.e., min(my_value).

So I wrote a small function that does it:

my_func <- function(x) {
  if (var(x) == 0) {
    return(x[1])
  }
  # else:
  min(x)
}

And now I can use either dplyr or data.table to summarize by id:

library(dplyr)
library(data.table)

# dplyr
df %>%
  group_by(id) %>%
  summarise(my_min_val = my_func(my_value))
#> # A tibble: 3 x 2
#>      id my_min_val
#>   <dbl>      <dbl>
#> 1     1          2
#> 2     2          2
#> 3     3          3

# data.table
setDT(df)[, .(my_min_val = my_func(my_value)), by = "id"]
#>    id my_min_val
#> 1:  1          2
#> 2:  2          2
#> 3:  3          3

So far so good. My problem is that I lost the year value. I want the respective year value for each chosen my_value.

My desired output should look like:

# desired output
desired_output <- 
  tribble(~id, ~my_min_val, ~year,
          1,   2,           2010,  # because for id 1, var(my_value) is 0, and hence my_value[1] corresponds to year 2010
          2,   2,           2015,  # because for id 2, var(my_value) is not 0, and hence min(my_value) (which is 2) corresponds to year 2015
          3,   3,           2015)  # because for id 3, var(my_value) is 0, hence my_value[1] corresponds to year 2015

I especially seek a data.table solution because my real data is very large (over 1 million rows) and with many groups. Thus efficiency is important. Thanks!

Emman
  • 3,695
  • 2
  • 20
  • 44
  • 1
    Related: [Extract row corresponding to minimum value of a variable by group](https://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group) – Henrik Dec 06 '21 at 20:48

1 Answers1

2

We may use the condition in slice

library(dplyr)
my_func <- function(x) if(var(x) == 0) 1 else which.min(x)
df %>% 
   group_by(id) %>% 
   slice(my_func(my_value)) %>%
   ungroup

-output

# A tibble: 3 × 3
     id  year my_value
  <dbl> <dbl>    <dbl>
1     1  2010        2
2     2  2015        2
3     3  2015        3

Or using data.table

library(data.table)
setDT(df)[df[, .I[my_func(my_value)], id]$V1]
   id year my_value
1:  1 2010        2
2:  2 2015        2
3:  3 2015        3

Or with slice_min and with_ties = FALSE

df %>%
    group_by(id) %>% 
    slice_min(n = 1, order_by = my_value, with_ties = FALSE)  %>%
    ungroup

-output

# A tibble: 3 × 3
     id  year my_value
  <dbl> <dbl>    <dbl>
1     1  2010        2
2     2  2015        2
3     3  2015        3
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. It is important to me to get the *first* element when `var(my_value) == 0` (rather than just the `min()`) *because* I need the respective year of the first element in the scenario where all elements are identical. – Emman Dec 06 '21 at 20:19
  • @Emman what about the update – akrun Dec 06 '21 at 20:21
  • Yes! The first seems perfect. I'm not sure about the second option. Does the 2nd way follow the same condition? – Emman Dec 06 '21 at 20:24
  • @Emman I guess the `var` is 0 is when all elements are the same right? there may be some difference – akrun Dec 06 '21 at 20:24
  • yes I guess so. Do you happen to know a `data.table` equivalent (to the first method)? – Emman Dec 06 '21 at 20:27
  • @Emman Yes, instead of `var(my_value) == 0` you may also use `n_distinct(my_value) == 1` because `var` have some limitations i.e. numeric only type – akrun Dec 06 '21 at 20:31
  • @Emman updated the post with `data.table` – akrun Dec 06 '21 at 20:33
  • 1
    Wow you're incredible. Thank you very much. – Emman Dec 06 '21 at 20:34
  • 1
    As for `var()` being limited, yes I know. I think I'll go with `data.table::uniqueN(x) == 1` – Emman Dec 06 '21 at 20:36