Keep observations based on other columns

Question

This question is the extension from here.
If my data has a column called Remark:

ID    Name    Type    Date          Amount   Remark
1     AAAA    First   2009/7/20     100      Not want
1     AAAA    First   2010/2/3      200      want ya
2     BBBB    First   2015/3/10     250      
2     CCC     Second  2009/2/23     300      good
2     CCC     Second  2010/1/25     400      OK Right123
2     CCC     Third   2015/4/9      500      
2     CCC     Third   2016/6/25     700      Stackoverflow is awesome

I want my result to keep it when the Date is max.
First, if i don't consider column Remark, I can use max() to get this:

dt[,.(Date = max(Date), Amount = sum(Amount)), by = .(ID, Name, Type)]
   ID Name   Type       Date  Amount
1:  1 AAAA  First 2010-02-03     300
2:  2 BBBB  First 2015-03-10     250
3:  2  CCC Second 2010-01-25     700
4:  2  CCC  Third 2016-06-25    1200

However, how can I keep Remark.

   ID Name   Type       Date  Amount      Remark
1:  1 AAAA  First 2010-02-03     300      want ya
2:  2 BBBB  First 2015-03-10     250      
3:  2  CCC Second 2010-01-25     700      OK Right123
4:  2  CCC  Third 2016-06-25    1200      Stackoverflow is awesome

Here is my data:

dt <- fread("
        ID    Name    Type    Date          Amount   Remark
        1     AAAA    First   2009/7/20     100      Not.want
        1     AAAA    First   2010/2/3      200      want.ya
        2     BBBB    First   2015/3/10     250      
        2     CCC     Second  2009/2/23     300      good
        2     CCC     Second  2010/1/25     400      OK.Right123
        2     CCC     Third   2015/4/9      500      
        2     CCC     Third   2016/6/25     700      Stackoverflow.is.awesome
        ")
dt$Date <- as.Date(dt$Date)

See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 We should be able to copy-paste your code in a new R session and see the same example data. I'm still seeing non-Dates there...also, I get an error when running that `fread`. — Frank, Aug 03 '17 at 03:08

akrun · Accepted Answer · 2017-08-03T03:22:09.430

We can use a join

setcolorder(dt[, setdiff(names(dt), "Amount"), with = FALSE][dt[,  .(Date = max(Date), 
                 Amount = sum(Amount)),
       by = .(ID, Name, Type)], on = .(ID, Name, Type, Date)], names(dt))[]
#   ID Name   Type       Date Amount                   Remark
#1:  1 AAAA  First 2010-02-03    300                  want ya
#2:  2 BBBB  First 2015-03-10    250                         
#3:  2  CCC Second 2010-01-25    700              OK Right123
#4:  2  CCC  Third 2016-06-25   1200 Stackoverflow is awesome

Or without a join

dt1 <- dt[, c(Amount = sum(.SD[["Amount"]]), .SD[which.max(Date), 
  setdiff(names(.SD), "Amount"), with = FALSE]), .(ID, Name, Type)]

setcolorder(dt1, names(dt))
dt1
#   ID Name   Type       Date Amount                   Remark
#1:  1 AAAA  First 2010-02-03    300                  want ya
#2:  2 BBBB  First 2015-03-10    250                         
#3:  2  CCC Second 2010-01-25    700              OK Right123
#4:  2  CCC  Third 2016-06-25   1200 Stackoverflow is awesome

If there are more number of 'Amount' columns to be summed

nm1 <- grep("Amount\\d*", names(dt), value = TRUE)
setcolorder(dt[, setdiff(names(dt), nm1), with = FALSE][dt[, c(Date= max(Date),
       lapply(.SD, sum)), by = .(ID, Name, Type), .SDcols = nm1],
      on = .(ID, Name, Type, Date)], names(dt))[]

If I have more than 3 columns need to be summed(`Amount`, `Amount1`, `Amount2`), how can i do? — Peter Chen, Aug 03 '17 at 03:15
@PeterChen In that case, use `dt[, c(Date = max(Date), lapply(.SD, sum)), by = .(ID, Name, Type), .SDcols = AmountCols]` within the second chain of first solution and make changes for the 'Amount' columns with `setdiff` — akrun, Aug 03 '17 at 03:17

score 1 · Answer 2 · answered Aug 03 '17 at 03:18

> df
   ID Name   Type       Date Amount                   Remark
1:  1 AAAA  First 03-02-2010    200                  want ya
2:  2  CCC  Third 09-04-2015    500                         
3:  2 BBBB  First 10-03-2015    250                         
4:  1 AAAA  First 20-07-2009    100                 Not want
5:  2  CCC Second 23-02-2009    300                     good
6:  2  CCC Second 25-01-2010    400              OK Right123
7:  2  CCC  Third 25-06-2016    700 Stackoverflow is awesome

> df2=df[,.(Date = max(Date), Amount = sum(Amount)), by = .(ID, Name, Type)]
> df2
   ID Name   Type       Date Amount
1:  2 BBBB  First 10-03-2015    250
2:  1 AAAA  First 20-07-2009    300
3:  2  CCC Second 25-01-2010    700
4:  2  CCC  Third 25-06-2016   1200


> df[df2,]
   ID Name   Type       Date Amount                   Remark i.ID i.Name i.Type i.Amount
1:  2 BBBB  First 10-03-2015    250                             2   BBBB  First      250
2:  1 AAAA  First 20-07-2009    100                 Not want    1   AAAA  First      300
3:  2  CCC Second 25-01-2010    400              OK Right123    2    CCC Second      700
4:  2  CCC  Third 25-06-2016    700 Stackoverflow is awesome    2    CCC  Third     1200


> df3=df[df2,c("ID","Name","Type","Date","Remark","i.Amount")]
> df3
   ID Name   Type       Date                   Remark i.Amount
1:  2 BBBB  First 10-03-2015                               250
2:  1 AAAA  First 20-07-2009                 Not want      300
3:  2  CCC Second 25-01-2010              OK Right123      700
4:  2  CCC  Third 25-06-2016 Stackoverflow is awesome     1200

your answer has some problems. not correct. But the way is right. — Peter Chen, Aug 03 '17 at 03:29

Keep observations based on other columns

2 Answers2