Matching two dataframes and pulling in a maximum value

Question

I have two tables:

df_workingFile

      group   v   
1      a    110 
2      a    90 
3      b    57 
4      b    53

df_groupIDs

  group   
1      a 
2      b

I want to pull in the maximum value from df_workingFile for a group label into df_groupIDs

df_groupIDs

  groups   max
1      a   110
2      b    57

I have code that does this, but it's really slow with large datasets.

data.frame(df_groupIDs, maxValue = 
   sapply(df_groupIDs$group, function(newCol) 
   max(subset(df_workingFile, newCol == df_workingFile$group)$v)))

Any suggestions on how to make this more efficient?

Frank · Answer 1 · 2017-08-16T17:37:52.163

With data.table, this is an "update join":

library(data.table)
setDT(df_groupIDs)
setDT(df_workingFile)

df_groupIDs[ , mx := df_workingFile[.(.SD$group), on=.(group), max(v), by=.EACHI]$V1 ]

#    group  mx
# 1:     a 110
# 2:     b  57

The .SD$group part is not idiomatic. It should just be .SD, but there's an open bug preventing that.

How it works

The syntax for a join is x[i, on=], where i is used to look up rows in x according to the rules in on=. The syntax for an update is x[, var := expr]. Here, these are combined like x[, var := x2[.SD, on=]], where .SD refers to the subset of data from x. We can equivalently write the expr as

df_workingFile[df_groupIDs, on=.(group), max(v), by=.EACHI]$V1

See the documentation in ?data.table regarding what by=.EACHI is doing. V1 is just the default name given to computed results (in this case, max(v)).

Performance

I have code that does this, but it's really slow with large datasets.

This approach should be fairly efficient since (i) it assigns to the existing df_groupIDs table instead of creating a new one; (ii) it only computes the max on the groups that appear in df_groupIDs (instead of all that appear in df_workingFile); and (iii) the grouped maximum is optimized (see ?GForce).

For an example comparing timings (of update join vs dplyr, match, etc), see merge data.frames based on year and fill in missing values .

www · Answer 2 · 2017-08-16T17:21:50.103

1

You can calculate the maximum for each ID first in df_workingFile. After that, conduct a left_join.

library(dplyr)

df_max <- df_workingFile %>%
    group_by(group) %>%
    summarise(max = max(v))

df_final <- df_groupIDs %>% left_join(df_max, by = "group")

Here is the base R approach.

df_max <- aggregate(v ~ group, data = df_workingFile, FUN = max)

df_final <- merge(df_groupIDs, df_max, by = "group", all.x = TRUE)

edited Aug 16 '17 at 17:21

answered Aug 16 '17 at 17:11

www

38,575
12
48
84

I guess they want the new col assigned to existing table df_groupIDs, like `df_groupIDs$col <- ...` in base r. – Frank Aug 16 '17 at 17:14
1

Thanks for the comment. I think `df_max` is ready to do a merge with `df_groupIDs`. I will update my answer to include that part @Frank. – www Aug 16 '17 at 17:18

score 1 · Answer 3 · answered Aug 16 '17 at 17:27

Here is a base R method that uses tapply to calculate the maximum values and then adds them to the second data.frame with transform, using match to find the proper indices.

# calculate maximum values
vals <- with(df_workingFile, tapply(v, group, max))
# add new variable
transform(df_groupIDs, max=vals[match(group, names(vals))])
  group max
a     a 110
b     b  57

To add the variable, a more standard notation would be

df_groupIDs$max <- vals[match(df_groupIDs$group, names(vals))]

Matching two dataframes and pulling in a maximum value

3 Answers3