Extract top N values base on multiple variables

Question

I known how to extract the top Nth value using plyr (ddply) as shown below but that is based on sorting and extracting based on one variable... I want to extract based on two variables. how do I do that?

ex.

rats<- read.table("http://vincentarelbundock.github.io/Rdatasets/csv/KMsurv/rats.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
top = ddply(rats, .(litter), function(x) data.frame(reads=tail(sort(x$time),1)))

the script above works great to extract the top value for each litter but what if I want the top value for each litter by treatment(rx column)?

Any help greatly appreciated.

also it looks like all litters have one 1 and two 0 rx, so you could just sort and index like: `o <- rats[with(rats, order(litter, rx, time)), ]; o[rep(c(FALSE, TRUE, TRUE), 50), ]` — rawr, Apr 03 '15 at 18:42
@rawr that is the most direct answer given the existing code - post as an answer so I can vote for it? — Sam Firke, Apr 03 '15 at 18:42
I tried rwar answer above before I posted and it didn't work. That is why I was looking for a different approach. — Ina.Quest, Apr 03 '15 at 19:37

Sam Firke · Accepted Answer · 2015-04-03T20:24:46.400

3

This produces the top time value for each combination of litter and rx:

library(dplyr)
rats %>%
  group_by(litter, rx) %>% # group
  arrange(-time) %>% # sort
  slice(1) # take top 1 row per subgroup


Source: local data frame [100 x 5]
Groups: litter, rx

    X litter rx time status
1   3      1  0  104      0
2   1      1  1  101      0
3   6      2  0  104      0
4   4      2  1  104      0
5   8      3  0  104      0
6   7      3  1  104      0
7  11      4  0   97      0
8  10      4  1   77      0
9  14      5  0  104      0
10 13      5  1   89      0
.. ..    ... ..  ...    ...

Your question is titled "Top N" - if you want more than the top 1, adjust that argument in the slice() call.

Handling ties: If there are ties within subgroups (as in this data set) and you want Top-N where N > 1, this will return records with identical values for time. For the Top N unique values, you could add distinct(time) %>% as a line after the group_by to filter for unique values.

edited Apr 03 '15 at 20:24

answered Apr 03 '15 at 18:34

Sam Firke

21,571
9
87
105

This is going to be silly but how do I see the rest of the results? I tried it with my dataset but it stops at 10 entries, just like above – Ina.Quest Apr 03 '15 at 18:57
`dplyr` will return a `tbl_df` and only show those first 10 rows, intentionally; some options for seeing all of the rows are here: http://stackoverflow.com/questions/23188900/view-entire-dataframe-when-wrapped-in-tbl-df. If you don't want to mess with the `tbl_df` at all you can wrap that `dplyr` code block in a `data.frame()` call. – Sam Firke Apr 03 '15 at 18:59
Ok I did the data.frame wrap but now looking at the data I can see that there are double entries for some litters. For example if you look at the results from the dataset above rows 11-12 are the same (and so are 18-19, etc.) - which is occurring because the entries have the same time value... how do you drop identical results? – Ina.Quest Apr 03 '15 at 19:13
I just changed the answer from using `top_n()` (which will return ties as you noticed) to a sort-and-slice approach using `arrange()` to sort in decreasing order of time and then `slice()` to select the top row for each subgroup. You could keep the `top_n()` and just slice after that line to break the ties; I slightly prefer this because if you want top 2, say, you only need to adjust in one place instead of two. – Sam Firke Apr 03 '15 at 19:22

Extract top N values base on multiple variables

1 Answers1