0

I am trying to do a groupby maximum operation in R on ~50k rows. My data looks like so:

> head(df, 10)
   group       val
1      2 0.9891907
2      2 0.8965835
3      2        NA
4      2        NA
5      3        NA
6      4 0.8681051
7      4 0.7861292
8      5 0.9110303
9      7        NA
10     7        NA

I want to create a new column maxval which has a group-wise maximum of val, based on the groups in group. I want to ignore missing values when a group has any non-missing values, and I want to return NA when the group has all missing values. So the expected result for these first few rows would be:

   group    val maxval
 1     2  0.989  0.989 # 0.989 is the max value for all of group == 2
 2     2  0.897  0.989
 3     2 NA      0.989
 4     2 NA      0.989
 5     3 NA     NA     # for group == 3, val is always missing, so return NA
 6     4  0.868  0.868
 7     4  0.786  0.868
 8     5  0.911  0.911
 9     7 NA     NA     # for group == 7, val is always missing, so return NA
10     7 NA     NA  

I attempted to do this using dplyr tools:

df %>% group_by(group) %>% mutate(maxval=max(val, na.rm=T)) %>% ungroup()

This works*, but is terribly slow (nearing 30 seconds):

> system.time(df %>% group_by(group) %>% mutate(maxval=max(val, na.rm=T)) %>% ungroup())
   user  system elapsed 
 27.021   0.093  27.171 

* Caveat because it returns -Inf instead of NA, but this can be fixed quickly.

If I omit the na.rm = T from the call to max, the operation is instantaneous (0.06 seconds). But the output is then incorrect, because groups with only-partial missing values return NA in the maxval column.

I thought the slowness might be due to the warnings produced from taking max on an empty sequence, but using suppressWarnings does not improve the timing:

# following here: https://stackoverflow.com/q/46239615/13386979
suppressWarnings(df %>% group_by(group) %>% mutate(maxval=max(val, na.rm=T)) %>% ungroup())

I have found a solution that I will post, but I don't really understand why it works and I also wanted to know if there is a better solution. I am not very familiar with R, so let me know how you would do this (or if I am missing something obvious). I am open to using other non-base packages. Thanks!


Constructing code for the data:

set.seed(13)

# create data
n <- 50000
df <- data.frame(group = sample(1:n, size=n, replace=T),
                 val = runif(n))

# sort
df <- df[order(df$group), ]
rownames(df) <- NULL

# sparsify
df$val <- ifelse(df$val < .75, NA, df$val)
Tom
  • 8,310
  • 2
  • 16
  • 36

2 Answers2

1

Using if(){} we can bypass the max calculation if the entire vector is NA. This is a massive speed-up:

fmax = function(x, na.rm = TRUE) {
  if(all(is.na(x))) return(x[1])
  return(max(x, na.rm = na.rm))
}

system.time(df %>%
  group_by(group) %>%
  mutate(maxval = fmax(val)))
# user  system elapsed 
# 0.20    0.01    0.22 
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks, I thought this was the same as my answer but does seem to be faster! – Tom Aug 17 '21 at 16:29
  • 1
    @Tom yes, your answer is slower because [ifelse calculates the full `if` and full `else` answers, and then picks the right one to use](https://stackoverflow.com/q/16275149/903061). My answer is faster because the code wrapped in `if()` is not run if the condition isn't met--and it's the slow part. Often `ifelse` is still faster due to vectorization, but here the condition is a single `TRUE` or `FALSE` for a whole group, so we don't need the vectorized `ifelse`. – Gregor Thomas Aug 17 '21 at 18:50
0

I grabbed this custom maximum function from this post. As desired, it will return NA when all the values of the group are NA:

> my.max <- function(x) ifelse( !all(is.na(x)), max(x, na.rm=T), NA)
> df %>% group_by(group) %>% mutate(maxval=my.max(val)) %>% ungroup()
# A tibble: 50,000 x 3
   group    val maxval
   <int>  <dbl>  <dbl>
 1     2  0.989  0.989
 2     2  0.897  0.989
 3     2 NA      0.989
 4     2 NA      0.989
 5     3 NA     NA    
 6     4  0.868  0.868
 7     4  0.786  0.868
 8     5  0.911  0.911
 9     7 NA     NA    
10     7 NA     NA    
# ... with 49,990 more rows

> system.time(df %>% group_by(group) %>% mutate(maxval=my.max(val)) %>% ungroup())
   user  system elapsed 
   0.14    0.00    0.14 

This is ~2-3x slower than the regular max with na.rm = F, but still many fold faster than with na.rm = T (and gives the correct output).

Tom
  • 8,310
  • 2
  • 16
  • 36