2

I found this thread Find rows in dataframe with maximum values grouped by values in another column where one of the solution has been discussed. I am using this solution to recursively find the row index with maximum quantity. However, my solution is very ugly--very procedural instead of vectorized.

Here's my dummy data:

dput(Data)

structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999, 
1999, 2000, 2000, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2002, 
2002, 2002, 2002), Ship_Year = c(1997, 1998, 1999, 2000, 2001, 
2002, NA, 1997, NA, 1997, 1998, 1999, 2000, 2001, 2002, NA, 1997, 
1998, 1999, 2000), Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62, 
627206.75998, 531087.43, 122167.02, 143855.55, 0, 0, 0, 0, 53650.389998, 
17708416.3198, 98196.4, 31389, 0), Units = c(37, 1, 8, 5, 8, 
8, 730, 99, 91, 195, 259, 4, 1, 3, 3, 53, 3844, 142, 63, 27)), .Names = c("Order_Year", 
"Ship_Year", "Yen", "Units"), row.names = c(NA, 20L), class = "data.frame")

I want to find out the Ship_Year for which Yen and Units are maximum for a given Order_Year.

Here's what I tried:

a<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Yen), ]))
rownames(a)<-NULL
a$Yen<-NULL
a$Units<-NULL
#a has Ship_Year for which Yen is max for a given Order_Year
names(a)[2]<-"by.Yen" 
#Now I'd find max year by units
b<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Units), ]))
rownames(b)<-NULL
b$Yen<-NULL
b$Units<-NULL
#b has Ship_Year for which Units is max for a given Order_Year
names(b)[2]<-"by.Qty"
c<-a %>% left_join(b)

The expected output is:

c
  Order_Year by.Yen by.Qty
1       1999     NA     NA
2       2000   1997   1997
3       2001   1998   1998
4       2002   1997   1997

While I got the expected output, the method above is very clunky. Is there a better way to handle this?

Community
  • 1
  • 1
watchtower
  • 4,140
  • 14
  • 50
  • 92

3 Answers3

4

which.max works well with dplyr grouping:

library(dplyr)

Data %>% group_by(Order_Year) %>% 
    summarise(by.Yen = Ship_Year[which.max(Yen)], 
              by.Units = Ship_Year[which.max(Units)])

## # A tibble: 4 × 3
##   Order_Year by.Yen by.Units
##        <dbl>  <dbl>    <dbl>
## 1       1999     NA       NA
## 2       2000   1997     1997
## 3       2001   1998     1998
## 4       2002   1997     1997
alistaire
  • 42,459
  • 4
  • 77
  • 117
3

using Base R

a1 <- with(df1,
           by(data    = df1,
              INDICES = Order_Year, 
              FUN     = function(x) list(Yen   = x$Ship_Year[which.max(x$Yen)],
                                         Units = x$Ship_Year[which.max(x$Units)])))

do.call("rbind", lapply(a1, function(x) data.frame(x)))
#       Yen Units
# 1999   NA    NA
# 2000 1997  1997
# 2001 1998  1998
# 2002 1997  1997

Data:

df1 <- structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999, 1999,
                                     2000, 2000, 2001, 2001, 2001, 2001, 2001,
                                     2001, 2001, 2002, 2002, 2002, 2002),
                      Ship_Year = c(1997, 1998, 1999, 2000, 2001, 2002, NA, 
                                    1997, NA, 1997, 1998, 1999, 2000, 2001, 
                                    2002, NA, 1997, 1998, 1999, 2000),
                      Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62, 627206.75998, 
                              531087.43, 122167.02, 143855.55, 0, 0, 0, 0,
                              53650.389998, 17708416.3198, 98196.4, 31389, 0), 
                      Units = c(37, 1, 8, 5, 8, 8, 730, 99, 91, 195, 259, 4,
                                1, 3, 3, 53, 3844, 142, 63, 27)), 
                 .Names = c("Order_Year", "Ship_Year", "Yen", "Units"), 
                 row.names = c(NA, 20L),
                 class = "data.frame")
Sathish
  • 12,453
  • 3
  • 41
  • 59
  • The edited answer gives the correct solution. Previous aggregate function and the output from it will not work all times and its output matching with your expected results is coincidental. Hope this helps without any package – Sathish Jan 16 '17 at 05:37
2

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(Data)), grouped by 'Order_Year', we get the index of the maximum value of 'Yen', 'Units' with match, subset the corresponding values of 'Ship_Year' based on that index to return the summarised output

library(data.table)
setDT(Data)[,.(by.Yen = Ship_Year[match(max(Yen), Yen)],
        by.Units = Ship_Year[match(max(Units), Units)]) , Order_Year]
#   Order_Year by.Yen by.Units
#1:       1999     NA       NA
#2:       2000   1997     1997
#3:       2001   1998     1998
#4:       2002   1997     1997

If there are many columns, instead of doing this separately, we can specify the columns of interest in .SDcols, grouped by 'Order_Year', loop through the Subset of Data.table (.SD) to get the index of maximum value, unlist the list output, subset the 'Ship_Year' based on that index, convert to list (as.list) and set the names of the columns to 'by.Yen' and 'by.Units'

setnames(setDT(Data)[,  as.list(Ship_Year[unlist(lapply(.SD, 
  which.max))]), Order_Year, .SDcols = c("Yen", "Units")], 
                2:3, c("by.Yen", "by.Units"))[]
#    Order_Year by.Yen by.Units
#1:       1999     NA       NA
#2:       2000   1997     1997
#3:       2001   1998     1998
#4:       2002   1997     1997
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @akrun- thanks so much for you help. Do you mind explaining the steps? I tried executing your code, but couldn't really understand it. – watchtower Jan 16 '17 at 03:11
  • 1
    @watchtower I updated with the explanation. Hope it helps – akrun Jan 16 '17 at 04:17
  • thanks for your help. I thought of going with Alistaire's answer as an answer because of its simplicity. I hope you understand. – watchtower Jan 16 '17 at 05:00