I have timestamped data. Sometimes, and rarely, due to the resolution of the timestamp (e.g., to the nearest millisecond), I get multiple updates at a single timestamp. I wish to group by timestamp, aggregate the data, and then return the last row in each group.
I find that the obvious thing to do in dplyr
takes a very very long time, especially compared to data.table
. While this may be in part due to how much faster data.table
is when the number of groups exceeds 100K (see benchmark section here), I am interested to know whether there is a way to make this operation faster in dplyr
(or even in data.table
) by exploiting the fact that groups with more than one row are very sparse.
Example data (10 million rows, only 1000 groups with more than 1 row of data):
tmp_df <- data.frame(grp = seq_len(1e7))
set.seed(0)
tmp_df_dup <-
tmp_df %>%
sample_frac(1e-4)
tmp_df_dup <-
tmp_df_dup[rep(seq_len(nrow(tmp_df_dup)), 3), ,drop = F] %>%
arrange(grp) %>%
group_by(grp) %>%
mutate(change = seq(3)) %>%
ungroup
tmp_df <-
tmp_df %>%
left_join(tmp_df_dup, by = 'grp')
The following operation takes 7 minutes on my machine:
time_now <- Sys.time()
tmp_result <-
tmp_df %>%
group_by(grp) %>%
mutate(change = cumsum(change)) %>%
filter(row_number() == n()) %>%
ungroup
print(Sys.time() - time_now)
# Time difference of 7.340796 mins
In contrast, data.table
only takes less than 10 seconds:
time_now <- Sys.time()
setDT(tmp_df)
tmp_result_dt <-
tmp_df[, .(change = cumsum(change)), by = grp]
tmp_result_dt <-
tmp_result_dt[tmp_result_dt[, .I[.N], by = grp]$V1]
print(Sys.time() - time_now)
# Time difference of 9.033687 secs