0

I don't know how to name the proper title; however, following is my question.
I have a data:

ID    Name    Type    Date          Amount
1     AAAA    First   2009/7/20     100
1     AAAA    First   2010/2/3      200
2     BBBB    First   2015/3/10     250
2     CCC     Second  2009/2/23     300
2     CCC     Second  2010/1/25     400
2     CCC     Third   2015/4/9      500
2     CCC     Third   2016/6/25     700

I want to remove the data that has same ID, Name, and Type; but the Date is smaller. Or you can say that keep Date is the largest.
The result is like:

ID    Name    Type    Date        Amount  
1     AAAA    First   2010/2/3    300
2     BBBB    First   2015/3/10   250
2     CCC     Second  2010/1/25   700
2     CCC     Third   2016/6/25   1200

I know I can use duplicated() to get the which observations are duplicating.

dt <- fread("
        ID    Name    Type    Date
        1     AAAA    First   2009/7/20
        1     AAAA    First   2010/2/3
        2     BBBB    First   2015/3/10
        2     CCC     Second  2009/2/23
        2     CCC     Second  2010/1/25
        2     CCC     Third   2015/4/9
        2     CCC     Third   2016/6/25
        ")
dt$Date <- as.Date(dt$Date)
dt[duplicated(ID) & duplicated(Name) & duplicated(Type)]
   ID Name   Type      Date   Amount
1:  1 AAAA  First  2010/2/3   200
2:  2  CCC Second 2010/1/25   400
3:  2  CCC  Third 2016/6/25   700

However, this is not I want. Although it removes the smaller Date, it cannot keep the third observation(ID=2, Name=BBBB, Type=First). Also, I still need to sum Amount.

How can I do?

Peter Chen
  • 1,464
  • 3
  • 21
  • 48
  • @RonakShah - `as.Date(Date)` will be a problem here - is `2010/2/3` the 3rd of Feb or the 2nd of March? You might want to put an explicit format on that variable. – thelatemail Aug 03 '17 at 01:44
  • 2
    `unique(dt, by=c("ID","Name","Type"), fromLast=TRUE)` if you've sorted. You should use a proper date type, not a string, btw. – Frank Aug 03 '17 at 01:44
  • Sorry, I edit my question. I miss some important points, – Peter Chen Aug 03 '17 at 01:46
  • 1
    Ok, then `.(last(Date), sum(Amount)), by=cols` like any other aggregation..? If so, someone could probably close this against https://stackoverflow.com/q/12064202/ which has a nice answer by Matt – Frank Aug 03 '17 at 01:47
  • In `dplyr`, `df %>% group_by(ID, Name, Type) %>% summarise(Date = max(as.Date(Date, "%Y/%m/%d")), Amount = sum(Amount))` – Ronak Shah Aug 03 '17 at 01:49
  • @Frank `last()` is a `data.table` function? – Peter Chen Aug 03 '17 at 01:50
  • 1
    Yes, and it's optimized in this case too. The `?GForce` doc explains, if you're interested. If it's not sorted, `max` works, too, of course. – Frank Aug 03 '17 at 01:51
  • I use `dt[, .(dplyr::summarise(Date = max(as.Date(Date, "%Y/%m/%d")), Amount = sum(Amount))), by = .(ID, Name, Type)]` but get error. How can I figure out it? – Peter Chen Aug 03 '17 at 01:58
  • 1
    oh. I figure it out. `dt[,.(max(Date) ,Amount = sum(Amount)), by = .(ID, Name, Type)]` is correct. – Peter Chen Aug 03 '17 at 02:05

0 Answers0