The OP has tried a data.table
solution. Here, we benefit from grouping and updating by reference simultaneously.
library(data.table)
setDT(group)[, diff := max(pt) - pt, by = Subject][]
Subject pt diff
1: 1 2 3
2: 1 3 2
3: 1 5 0
4: 2 2 15
5: 2 5 12
6: 2 8 9
7: 2 17 0
8: 3 3 2
9: 3 5 0
Data
ID <- c(1,1,1,2,2,2,2,3,3)
Value <- c(2,3,5,2,5,8,17,3,5)
group <- data.frame(Subject=ID, pt=Value)
Benchmark
At the time of writing, 5 answers were posted, including Frank's comment on the efficiency of th data.table
approach. So, I was wondering which of the five solutions were the fastest.
- r2evans
- mine
- Frank
- harelhan
- JonMinton
Some solutions modify the data.frame in place. To ensure a fair comparison, In addition,
The OP has required to create a new column called "diff". For comparison, all results should return group
with three columns. Some answers were modified accordingly. The answer of harelhan required substantial modifications to remove the errors.
As group
is modified, all benchmark runs start with a fresh copy of group
with two columns.
The benchmark is parameterized over the number of rows and the share of groups, i.e., the number of groups varies with the problem size in order to scale.
library(data.table)
library(dplyr)
library(bench)
bm <- press(
# n_row = c(1E2, 1E4, 1E5, 1E6),
n_row = c(1E2, 1E4, 1E5),
grp_share = c(0.01, 0.1, 0.5, 0.9),
{
n_grp <- grp_share * n_row
set.seed(1)
group0 <- data.frame(
Subject = sample(n_grp, n_row, TRUE),
pt = as.numeric(rpois(n_row, 100)))
mark(
r2Evans = {
group <- copy(group0)
group <- group %>%
group_by(Subject) %>%
mutate(diff = max(pt) - pt)
group
},
Uwe = {
group <- copy(group0)
setDT(group)[, diff := max(pt) - pt, by = Subject]
group
},
Frank = {
group <- copy(group0)
setDT(group)[, mx := max(pt), by=Subject][, diff := mx - pt][, mx := NULL]
group
},
harelhan = {
group <- copy(group0)
max_group <- group %>% group_by(Subject) %>% summarize(max_val = max(pt))
group <- left_join(group, max_group[, c("Subject", "max_val")], by = "Subject")
group$diff <- group$max_val - group$pt
group <- group %>% select(-max_val)
group
},
JonMinton = {
group <- copy(group0)
group <- group %>%
group_by(Subject) %>%
mutate(max_group_val = max(pt)) %>%
ungroup() %>%
mutate(diff = max_group_val - pt) %>%
select(-max_group_val)
group
}
)
}
)
ggplot2::autoplot(bm)
