Selecting the row subgroup with the max value of a variable

Question

so I have a data frame:

ID:    YearMon:   Var: Count:
1      012007      H            1
1      012007      D            2
1      022007                   NA
1      032007      H            1
2      012007      H            1
2      022007                   Na
2      022007      D            1
2      032007                   NA

How would i go about getting just the max value for each unique ID for a certain YearMon? Ideally it would return:

1      012007      D            2
1      022007                   NA
1      032007      H            1
2      012007      H            1
2      022007      D            1
2      032007                   NA

Badger · Accepted Answer · 2015-10-16T18:33:21.830

1

Using plyr this should be easily achieved. This will filter by ID and YearMon and return the max value along with the ID and YearMon in a data frame.

library(plyr)

ddply( dat1 , .(ID,YearMon)  ,function(x) {
Count = max( x$Count )
data.frame( Count=Count , Var=x[x$Count == Count,"Var"] )
})

In order to return all Columns:

df[ is.na( df$Count ) , "Count" ] <- -9999

df2 <- ddply(df, .(ID,YearMon) , function(x){

Count = max( x$Count )
index = which( x$Count == max( x$Count ))
y <- x[ index ,]                                
data.frame( y )

})

df2[ df2$Count == -9999, "Count" ] <- NA

This will return your indexing values back to NA as well.

edited Oct 16 '15 at 18:33

answered Oct 16 '15 at 18:05

Badger

1,043
10
25

When I run this replacing dat1 with my data set, I am getting the error object 'Count' not found although the variable name in my dataTable is Count. Is there something else I should be replacing as well? – swagbag Oct 16 '15 at 18:14
Yea sorry I fixed that in my code but not my answer, should function now. – Badger Oct 16 '15 at 18:15
Awesome worked perfectly. How would I go about returning an entire row from my old dataframe instead of creating a new one with only the 3 variables? – swagbag Oct 16 '15 at 18:20
I'm seeing below that you are interested in the NA's, I would suggest changing the NA's to a value in order to keep them. I will adjust above to show this. – Badger Oct 16 '15 at 18:30

score 1 · Answer 2 · answered Oct 16 '15 at 18:17

1

Using data.table, if you have a data table called dt, you can first calculate the max of Count by group, and then just keep the rows where Count is equal to the max for that group:

newdt <- dt[, max.count := max(Count), by=.(ID, YearMon)][Count==max.count,.(ID, YearMon, Var, Count)]

answered Oct 16 '15 at 18:17

Tchotchke

3,061
3
22
37

Hi, this worked really well, but is there a way to keep the rows with NA as well? – swagbag Oct 16 '15 at 18:25
@swagbag By definition when you group by ID and Year Month and take the max NA's will be removed. If you want to keep them you need to create a third NA variable to group by. – CCurtis Oct 16 '15 at 19:09

score 1 · Answer 3 · answered Oct 16 '15 at 18:20

1

library(dplyr)

dt %>%
  group_by(ID, YearMon) %>%
  slice(Count %>% which.max)

answered Oct 16 '15 at 18:20

bramtayl

4,004
2
11
18

I found this very helpful, thank you. Can you please explain your code? Specifically the last line involving `slice`. – Jash Shah Apr 24 '17 at 11:13

score 0 · Answer 4 · answered Oct 16 '15 at 18:42

Lets not forget about aggregate!

  #####Clean up data. You need to change your grouping variables to factors and data needs to be numeric####

dat1$Var.[dat1$Var.==1]=""
dat1$Count.<-as.numeric(dat1$Count.)
dat1$ID.<-as.factor(dat1$ID.)
dat1$YearMon.<-as.factor(dat1$YearMon.)
dat1<-dat1[,-3] ###Lets get rid of the Var column as you're not using it.


aggregate(. ~ ID.+YearMon.,data = dat1,FUN=max ) #### Use aggregate. Simple and short code
  ID. YearMon. Count.
1   1    12007      2
2   2    12007      1
3   2    22007      1
4   1    32007      1

akrun · Answer 5 · 2015-10-17T06:03:41.510

Another data.table option using if/else. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'YearMon', if all of the 'Count' values are 'NA' we return the Subset of Data.table (.SD) or else we get the index of maximum value of 'Count' and subset the data.table (.SD[which.max(Count)]).

library(data.table)
setDT(df1)[, if(all(is.na(Count))) .SD else .SD[which.max(Count)],.(ID, YearMon)]
 #   ID YearMon Var Count
 #1:  1   12007   D     2
 #2:  1   22007        NA
 #3:  1   32007   H     1
 #4:  2   12007   H     1
 #5:  2   22007   D     1
 #6:  2   32007        NA

Or another option would be to concatenate the index from which.max and the rows which have all 'NA' for 'Count' grouped by the variables, get the row index (.I) and use that to subset the 'data.table'.

setDT(df1)[df1[, .I[c(which.max(Count), all(is.na(Count)))], .(ID, YearMon)]$V1]
#   ID YearMon Var Count
#1:  1   12007   D     2
#2:  1   22007        NA
#3:  1   32007   H     1
#4:  2   12007   H     1
#5:  2   22007   D     1
#6:  2   32007        NA

Or we replace the NA by a very small number, use which.max and subset

setDT(df1)[, .SD[which.max(replace(Count, is.na(Count),-Inf ))], .(ID, YearMon)]

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
 YearMon = c(12007L, 
12007L, 22007L, 32007L, 12007L, 22007L, 22007L, 32007L), Var = c("H", 
"D", "", "H", "H", "", "D", ""), Count = c(1L, 2L, NA, 1L, 1L, 
NA, 1L, NA)), .Names = c("ID", "YearMon", "Var", "Count"),
class = "data.frame", row.names = c(NA, -8L))

Selecting the row subgroup with the max value of a variable

5 Answers5

data