I am trying to do a groupby maximum operation in R on ~50k rows. My data looks like so:
> head(df, 10)
group val
1 2 0.9891907
2 2 0.8965835
3 2 NA
4 2 NA
5 3 NA
6 4 0.8681051
7 4 0.7861292
8 5 0.9110303
9 7 NA
10 7 NA
I want to create a new column maxval
which has a group-wise maximum of val
, based on the groups in group
. I want to ignore missing values when a group has any non-missing values, and I want to return NA
when the group has all missing values. So the expected result for these first few rows would be:
group val maxval
1 2 0.989 0.989 # 0.989 is the max value for all of group == 2
2 2 0.897 0.989
3 2 NA 0.989
4 2 NA 0.989
5 3 NA NA # for group == 3, val is always missing, so return NA
6 4 0.868 0.868
7 4 0.786 0.868
8 5 0.911 0.911
9 7 NA NA # for group == 7, val is always missing, so return NA
10 7 NA NA
I attempted to do this using dplyr
tools:
df %>% group_by(group) %>% mutate(maxval=max(val, na.rm=T)) %>% ungroup()
This works*, but is terribly slow (nearing 30 seconds):
> system.time(df %>% group_by(group) %>% mutate(maxval=max(val, na.rm=T)) %>% ungroup())
user system elapsed
27.021 0.093 27.171
* Caveat because it returns -Inf
instead of NA
, but this can be fixed quickly.
If I omit the na.rm = T
from the call to max
, the operation is instantaneous (0.06 seconds). But the output is then incorrect, because groups with only-partial missing values return NA
in the maxval
column.
I thought the slowness might be due to the warnings produced from taking max
on an empty sequence, but using suppressWarnings
does not improve the timing:
# following here: https://stackoverflow.com/q/46239615/13386979
suppressWarnings(df %>% group_by(group) %>% mutate(maxval=max(val, na.rm=T)) %>% ungroup())
I have found a solution that I will post, but I don't really understand why it works and I also wanted to know if there is a better solution. I am not very familiar with R, so let me know how you would do this (or if I am missing something obvious). I am open to using other non-base packages. Thanks!
Constructing code for the data:
set.seed(13)
# create data
n <- 50000
df <- data.frame(group = sample(1:n, size=n, replace=T),
val = runif(n))
# sort
df <- df[order(df$group), ]
rownames(df) <- NULL
# sparsify
df$val <- ifelse(df$val < .75, NA, df$val)