With data.table, this is an "update join":
library(data.table)
setDT(df_groupIDs)
setDT(df_workingFile)
df_groupIDs[ , mx := df_workingFile[.(.SD$group), on=.(group), max(v), by=.EACHI]$V1 ]
# group mx
# 1: a 110
# 2: b 57
The .SD$group
part is not idiomatic. It should just be .SD
, but there's an open bug preventing that.
How it works
The syntax for a join is x[i, on=]
, where i
is used to look up rows in x
according to the rules in on=
. The syntax for an update is x[, var := expr]
. Here, these are combined like x[, var := x2[.SD, on=]]
, where .SD
refers to the subset of data from x
. We can equivalently write the expr
as
df_workingFile[df_groupIDs, on=.(group), max(v), by=.EACHI]$V1
See the documentation in ?data.table
regarding what by=.EACHI
is doing. V1
is just the default name given to computed results (in this case, max(v)
).
Performance
I have code that does this, but it's really slow with large datasets.
This approach should be fairly efficient since (i) it assigns to the existing df_groupIDs
table instead of creating a new one; (ii) it only computes the max on the groups that appear in df_groupIDs
(instead of all that appear in df_workingFile
); and (iii) the grouped maximum is optimized (see ?GForce
).
For an example comparing timings (of update join vs dplyr, match, etc), see merge data.frames based on year and fill in missing values .