1

I would like to make my df smaller by just taking one observation per person per date, based on a persons biggest quantity per date.

Here's my df:

names      dates quantity
1    tom 2010-02-01       28
3    tom 2010-03-01        7
2   mary 2010-05-01       30
6    tom 2010-06-01       21
4   john 2010-07-01       45
5   mary 2010-07-01       30
8   mary 2010-07-01       28
11   tom 2010-08-01       28
7   john 2010-09-01       28
10  john 2010-09-01       30
9   john 2010-07-01       45
12  mary 2010-11-01       28
13  john 2010-12-01        7
14  john 2010-12-01       14

I do this first by finding the max quantity per person per date. This works ok, but as you can see, if a person has equal quantities they retain the same amount of obs per date.

merge(df, aggregate(quantity ~ names+dates, df, max))



 names      dates quantity
1   john 2010-07-01       45
2   john 2010-07-01       45
3   john 2010-09-01       30
4   john 2010-12-01       14
5   mary 2010-05-01       30
6   mary 2010-07-01       30
7   mary 2010-11-01       28
8    tom 2010-02-01       28
9    tom 2010-03-01        7
10   tom 2010-06-01       21
11   tom 2010-08-01       28

So, my next step would be to just take the first obs per date (given that I have already selected the biggest quantity). I can't get the code right for this. this is what I have tried:

merge(l, aggregate(names ~ dates, l, FUN=function(z) z[1]))->m  ##doesn't get rid of one obs for john

and a data.table option

l[, .SD[1], by=c(names,dates)]  ##doesn't work at all

I like the aggregate and data.table options as they are fast and by df has ~100k rows.

Thank you in advance for this!

SOLUTION

i posted too fast - apologies!! an easy solution to this problem is just to find duplicates and then remove those. e.g.,;

merge(df, aggregate(quantity ~ names+dates, df, max))->toy
toy$dup<-duplicated(toy)
toy<-toy[toy$dup!=TRUE,]

here are the system times

 system.time(dt2[, max(new_quan), by = list(hai_dispense_number, date_of_claim)]->method1)
   user  system elapsed 
  20.04    0.04   20.07 



 system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)
   user  system elapsed 
 19.129   0.028  19.148 
user2363642
  • 727
  • 9
  • 26
  • You should post your solution as an answer and accept it rather than include it in your question. – Thomas Aug 19 '13 at 18:17
  • @user2363642 There is a better way of doing this, by using `unique()`. Please refer to my answer below. – Mayou Aug 19 '13 at 18:18
  • it is both ethical and encouraged! – mac Aug 19 '13 at 18:21
  • 2
    If you can show that it is faster than all the rest, then it's "ethical". I think we're still waiting for a true data.table solution and it's likely to win. (I suspect this is a duplicate.) – IRTFM Aug 19 '13 at 18:22
  • agree with Dwin - but not about it being a duplicate. got the original aggregate code from here; http://stackoverflow.com/questions/9718711/selecting-rows-which-contain-daily-max-value-in-r I have learned from this and applied the answers to my own problem. If i have done something erroneous, please inform me. I don't want to breach normal etiquette here. – user2363642 Aug 19 '13 at 18:24
  • 1
    I also don't understand what the merge is doing. Why not just: `aggregate(quantity ~ names+dates, df, max)`? – IRTFM Aug 19 '13 at 18:28
  • Dwin - you have highlighted a great point. aggregate(quantity ~ names+dates, df, max) gives exactly the output I need. No need for merge. – user2363642 Aug 19 '13 at 18:32
  • seems that the plain aggregate code works best – user2363642 Aug 19 '13 at 18:47

4 Answers4

2

I am not sure this gives you the output you want, but it definitely takes care of the "duplicate rows":

 # Replicating your dataframe
 df <- data.frame(names = c("tom", "tom", "mary", "tom", "john", "mary", "mary", "tom", "john", "john", "john", "mary", "john", "john"), dates = c("2010-02-01","2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-07-01", "2010-07-01", "2010-08-01", "2010-09-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-12-01", "2010-12-01"), quantity = c(28,7,30,21,45,30,28,28,28,30,45,28,7,14)) 

 temp = merge(df, aggregate(quantity ~ names+dates, df, max))
 df.unique = unique(temp)
Mayou
  • 8,498
  • 16
  • 59
  • 98
2

Here's a data.table solution:

dt[, max(quantity), by = list(names, dates)]

Bench:

N = 1e6

dt = data.table(names = sample(letters, N, T), dates = sample(LETTERS, N, T), quantity = rnorm(N))
df = data.frame(dt)

op = function(df) aggregate(quantity ~ names+dates, df, max) 
eddi = function(dt) dt[, max(quantity), by = list(names, dates)]

microbenchmark(op(df), eddi(dt), times = 10)
#Unit: milliseconds
#     expr      min        lq   median        uq      max neval
#   op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209    10
# eddi(dt)  148.088  162.8073  198.222  220.1217  286.058    10
eddi
  • 49,088
  • 6
  • 104
  • 155
  • check out system times, i'll post above so easier to see – user2363642 Aug 19 '13 at 18:44
  • @user2363642 yep, `aggregate` is plenty fast for this as well - in my tests `data.table` is a little bit faster which you're not seeing in the noise of a single run, but still I'm quite impressed by `aggregate` speed here – eddi Aug 19 '13 at 18:51
  • @user2363642 hmm upon retesting `aggregate` is much slower (I realized I was testing on a very small sample), see edit – eddi Aug 19 '13 at 18:56
  • great conversation and exchange of methods. delighted. Thank you everyone! – user2363642 Aug 19 '13 at 18:57
  • how big is your df? mine is 89000 rows – user2363642 Aug 19 '13 at 18:58
  • 2
    @user2363642 you can see the size in the bench above and can try playing around with it - at 89000 rows it's still about 20x faster; I can't think of a reason right now why `aggregate` would be even remotely close in speed for your data – eddi Aug 19 '13 at 19:02
1

If you are using data.frame,

 library(plyr)
    ddply(mydata,.(names,dates),summarize, maxquantity=max(quantity))
Metrics
  • 15,172
  • 7
  • 54
  • 83
  • hi Metrics, thanks for this. does what I want in one step rather than two. My only concern is that ddply will take ages with my massive df (~100K rows). See my edit above. – user2363642 Aug 19 '13 at 18:19
1
do.call( rbind, 
        lapply( split(df, df[,c("names","dates") ]), function(d){
                                         d[which.max(d$quantity), ] } )
        )
IRTFM
  • 258,963
  • 21
  • 364
  • 487