1

From the original sample dataset:

id <- c('1','1','2', '2', '3', '3', '3')
month <- c('6', '6', '3', '3', '4', '4', '4')
iso <- c('MEX', 'USA', 'CRI', 'SPA', 'CHN', 'MEX', 'SPA')
value <- c('1550', '1550', '384', '115', '1100', '1100', '1100')
original <- data.frame(id, month, iso, value)

I want to end up with only 1 observation for each id-month pair. The rule to follow is:

  1. Choose the maximum value for each id-month pair.
  2. In case there is the same maximum value for the different observations of the same id-month pair, I want to choose one of the rows at random.

Therefore, the new sample dataset will look like this:

id <- c('1', '2', '3')
month <- c('6', '3', '4')
iso <- c('USA', 'CRI','MEX')
value <- c('1550', '384', '1100')
selection_criteria <- c('random','max_value','random')
new <- data.frame(id, month, iso, value, selection_criteria)

I have tried to run the following code:

new <- original %>% group_by(id, month) %>%
  filter(value == max(value))

However, it does not make the trick of selecting one variable at random when I have more than one observation (for the same id-month pair) with a maximum value.

My intention is to automatise the process given the large dimension of my dataset.

Any clue?

Thank you.

mck
  • 40,932
  • 13
  • 35
  • 50
vog
  • 770
  • 5
  • 11

3 Answers3

3

Try this

set.seed(2021)

new <- original %>% group_by(id, month) %>%
  slice_max(as.numeric(value)) %>% sample_n(1)

> new
# A tibble: 3 x 4
# Groups:   id, month [3]
  id    month iso   value
  <chr> <chr> <chr> <chr>
1 1     6     MEX   1550 
2 2     3     CRI   384  
3 3     4     MEX   1100

slice_max will cause filtering all max of values rows in each group. Further sample_n(size =1) will again restrict one row in each group

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
2

Two things:

  1. which.max is the more canonical way of matching the max value, instead of val == max(val); while unlikely with this data, if your data has differences in the small-decimal range, then floating-point equality can be a problem (c.f., Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754); I'm going to recommend a different approach entirely ...

  2. rank will tell you which values are the highest/lowest, and it's pretty straight forward to look for a rank of 1. In the presence of ties, it has several options, one of which is "random".

    rank(c(1,1), ties = "first")
    # [1] 1 2
    rank(c(1,1), ties = "last")
    # [1] 2 1
    rank(c(1,1), ties = "average")
    # [1] 1.5 1.5
    rank(c(1,1), ties = "random")
    # [1] 1 2
    rank(c(1,1), ties = "random")
    # [1] 2 1
    
  3. You're looking for max of a string, which will not always return the maximum numeric value from among the strings. For example, max(c("9", "11")) returns "9", since it uses lexicographic sorting. If you want the maximum numeric value, then we need to convert (at least temporarily) to numbers. If your intent truly is to sort alphabetically, and you want 9 to be greater than 11, then remove the as.numeric.

set.seed(2020)
original %>%
  group_by(id, month) %>%
  filter(rank(as.numeric(value), ties = "random") == 1L)
# # A tibble: 3 x 4
# # Groups:   id, month [3]
#   id    month iso   value
#   <chr> <chr> <chr> <chr>
# 1 1     6     USA   1550 
# 2 2     3     SPA   115  
# 3 3     4     MEX   1100 

set.seed(2021)
original %>%
  group_by(id, month) %>%
  filter(rank(as.numeric(value), ties = "random") == 1L)
# # A tibble: 3 x 4
# # Groups:   id, month [3]
#   id    month iso   value
#   <chr> <chr> <chr> <chr>
# 1 1     6     MEX   1550 
# 2 2     3     SPA   115  
# 3 3     4     CHN   1100 

Side Note

@AnilGoyal's answer suggests a more dplyr-idiomatic solution, using slice_max and sample_n. If you prefer to stay within the tidyverse with verbs that are more declarative, then perhaps that's a good way to go (since readability and maintainability are important).

Your data does not suggest that you're doing this in a large dataset, in which case benchmarking is not important. If I'm incorrect, and you do need to be concerned about performance, then the dplyr-only solution is a little slower.

# pre-group the data so that that isn't in the benchmark comparison;
# also, `as.numeric` is used in both, do that ahead of time
original_grouped <- mutate(original, value_num = as.numeric(value)) %>%
  group_by(id, month)

bench::mark(
  base = filter(original_grouped, rank(value_num, ties = "random") == 1L),
  dplyr = sample_n(slice_max(original_grouped, value_num), 1),
  check = FALSE)
# # A tibble: 2 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                  time           gc                
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                  <list>         <list>            
# 1 base         1.07ms   1.16ms      815.    9.34KB     4.18   390     2      478ms <NULL> <Rprofmem[,3] [8 x 3]>  <bch:tm [392]> <tibble [392 x 3]>
# 2 dplyr        2.25ms   2.45ms      378.   10.66KB     6.44   176     3      466ms <NULL> <Rprofmem[,3] [11 x 3]> <bch:tm [179]> <tibble [179 x 3]>

Though that advantage is decreased with larger data:

original_big <- bind_rows(replicate(10000, original, simplify = FALSE))
original_big_grouped <- mutate(original_big, value_num = as.numeric(value)) %>%
  group_by(id, month)
bench::mark(base = filter(original_big_grouped, rank(value_num, ties = "random") == 1L), dplyr = sample_n(slice_max(original_big_grouped, value_num), 1), check = FALSE)
# # A tibble: 2 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                   time          gc               
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                   <list>        <list>           
# 1 base          6.9ms   7.47ms      128.    4.82MB     16.0    48     6      375ms <NULL> <Rprofmem[,3] [50 x 3]>  <bch:tm [54]> <tibble [54 x 3]>
# 2 dplyr        7.61ms   8.23ms      119.   11.65MB     71.1    25    15      211ms <NULL> <Rprofmem[,3] [106 x 3]> <bch:tm [40]> <tibble [40 x 3]>
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

Using data.table

library(data.table)
setDT(original)[, .SD[which.max(value)], .(id, month)]
akrun
  • 874,273
  • 37
  • 540
  • 662