Subset by group with data.table compared to aggregate a data.table

Question

This is a follow up question to Subset by group with data.table using the same data.table:

library(data.table)

bdt <- as.data.table(baseball)

# Aggregating and loosing information on other columns
dt1 <- bdt[ , .(max_g = max(g)), by = id]
# Aggregating and keeping information on other columns
dt2 <- bdt[bdt[, .I[g == max(g)], by = id]$V1]

Why do dt1 and dt2 differ in number of rows? Isn't dt2 supposed to have the same result just without loosing the respective information in the other columns?

Look at, for example, `bdt[id == "woodge01"][order(-g)]` -- that id has multiple rows that maximize `g`. The second approach identifies all such rows, while the first simply returns the maximizing value. This is the difference between max (a value) and argmax (a set of maximizers): http://math.stackexchange.com/q/312012/ — Frank, Feb 02 '17 at 15:25
Thanks for your reply. This perfectly clarifies the difference! — andschar, Feb 02 '17 at 16:48

score 4 · Accepted Answer · edited Apr 13 '17 at 12:19

As @Frank pointed out:

bdt[ , .(max_g = max(g)), by = id] provides you with the maximum value, while

bdt[bdt[ , .I[g == max(g)], by = id]$V1] identifies all rows that have this maximum.

See What is the difference between arg max and max? for a mathematical explanation and try this slim version in R:

library(data.table)
bdt <- as.data.table(baseball)

dt <- bdt[id == "woodge01"][order(-g)]
dt[ , .(max = max(g)), by = id]
dt[ dt[ , .I[g == max(g)], by = id]$V1 ]

Subset by group with data.table compared to aggregate a data.table

1 Answers1

Linked