I would like to make my df smaller by just taking one observation per person per date, based on a persons biggest quantity per date.
Here's my df:
names dates quantity
1 tom 2010-02-01 28
3 tom 2010-03-01 7
2 mary 2010-05-01 30
6 tom 2010-06-01 21
4 john 2010-07-01 45
5 mary 2010-07-01 30
8 mary 2010-07-01 28
11 tom 2010-08-01 28
7 john 2010-09-01 28
10 john 2010-09-01 30
9 john 2010-07-01 45
12 mary 2010-11-01 28
13 john 2010-12-01 7
14 john 2010-12-01 14
I do this first by finding the max quantity per person per date. This works ok, but as you can see, if a person has equal quantities they retain the same amount of obs per date.
merge(df, aggregate(quantity ~ names+dates, df, max))
names dates quantity
1 john 2010-07-01 45
2 john 2010-07-01 45
3 john 2010-09-01 30
4 john 2010-12-01 14
5 mary 2010-05-01 30
6 mary 2010-07-01 30
7 mary 2010-11-01 28
8 tom 2010-02-01 28
9 tom 2010-03-01 7
10 tom 2010-06-01 21
11 tom 2010-08-01 28
So, my next step would be to just take the first obs per date (given that I have already selected the biggest quantity). I can't get the code right for this. this is what I have tried:
merge(l, aggregate(names ~ dates, l, FUN=function(z) z[1]))->m ##doesn't get rid of one obs for john
and a data.table option
l[, .SD[1], by=c(names,dates)] ##doesn't work at all
I like the aggregate and data.table options as they are fast and by df has ~100k rows.
Thank you in advance for this!
SOLUTION
i posted too fast - apologies!! an easy solution to this problem is just to find duplicates and then remove those. e.g.,;
merge(df, aggregate(quantity ~ names+dates, df, max))->toy
toy$dup<-duplicated(toy)
toy<-toy[toy$dup!=TRUE,]
here are the system times
system.time(dt2[, max(new_quan), by = list(hai_dispense_number, date_of_claim)]->method1)
user system elapsed
20.04 0.04 20.07
system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)
user system elapsed
19.129 0.028 19.148