1

This is a follow up question to Subset by group with data.table using the same data.table:

library(data.table)

bdt <- as.data.table(baseball)

# Aggregating and loosing information on other columns
dt1 <- bdt[ , .(max_g = max(g)), by = id]
# Aggregating and keeping information on other columns
dt2 <- bdt[bdt[, .I[g == max(g)], by = id]$V1]

Why do dt1 and dt2 differ in number of rows? Isn't dt2 supposed to have the same result just without loosing the respective information in the other columns?

Community
  • 1
  • 1
andschar
  • 3,504
  • 2
  • 27
  • 35
  • 1
    Look at, for example, `bdt[id == "woodge01"][order(-g)]` -- that id has multiple rows that maximize `g`. The second approach identifies all such rows, while the first simply returns the maximizing value. This is the difference between max (a value) and argmax (a set of maximizers): http://math.stackexchange.com/q/312012/ – Frank Feb 02 '17 at 15:25
  • 1
    Thanks for your reply. This perfectly clarifies the difference! – andschar Feb 02 '17 at 16:48
  • 1
    Cool, you can post an answer yourself if you want. – Frank Feb 02 '17 at 16:49

1 Answers1

4

As @Frank pointed out:

bdt[ , .(max_g = max(g)), by = id] provides you with the maximum value, while

bdt[bdt[ , .I[g == max(g)], by = id]$V1] identifies all rows that have this maximum.

See What is the difference between arg max and max? for a mathematical explanation and try this slim version in R:

library(data.table)
bdt <- as.data.table(baseball)

dt <- bdt[id == "woodge01"][order(-g)]
dt[ , .(max = max(g)), by = id]
dt[ dt[ , .I[g == max(g)], by = id]$V1 ]
Community
  • 1
  • 1
andschar
  • 3,504
  • 2
  • 27
  • 35