subset dataframe by multiple column names and return top n hits

Question

Here is my toy dataframe, the real one might have 40K-1M records and five additional columns

        animal1     version1    animal2     version2    sim             
53      20154620    TRUSEQ.v1   20104647    F250v1  0.3663569
854     20145687    TRUSEQ.v1   20105551    F250v1  0.5732854
3662    20154620    TRUSEQ.v1   20114509    F250v1  0.3374918
4063    20154620    TRUSEQ.v1   20114578    F250v1  0.3732692
4464    20154620    TRUSEQ.v1   20114595    F250v1  0.3772367
5262    20144516    TRUSEQ.v1   20115051    770k.v1 0.6034206
5663    20144516    TRUSEQ.v1   20115051    F250v1  0.6164795
5664    20145008    TRUSEQ.v1   20115051    F250v1  0.3146651
6064    20144516    TRUSEQ.v1   20115059    F250v1  0.3043295
6471    20165119    F250v1      20115096    F250v1  0.388435
9677    20154620    TRUSEQ.v1   20118095    F250v1  0.3079702
11281   20154620    TRUSEQ.v1   20134529    F250v1  0.3188631
12486   20165119    F250v1      20135032    F250v1  0.6091486
13282   20144516    TRUSEQ.v1   20135047    F250v1  0.3098507
14090   20165119    F250v1      20135072    F250v1  0.3025007
14892   20165119    F250v1      20135122    F250v1  0.345238

For each animal1, I need all rows featuring the top 3 unique animal2 values by highest sim... so my desired result is reproduced below.

        animal1     version1    animal2     version2    sim
5663    20144516    TRUSEQ.v1   20115051    F250v1  0.6164795
5262    20144516    TRUSEQ.v1   20115051    770k.v1 0.6034206
13282   20144516    TRUSEQ.v1   20135047    F250v1  0.3098507
6064    20144516    TRUSEQ.v1   20115059    F250v1  0.3043295
5664    20145008    TRUSEQ.v1   20115051    F250v1  0.3146651
854     20145687    TRUSEQ.v1   20105551    F250v1  0.5732854
4464    20154620    TRUSEQ.v1   20114595    F250v1  0.3772367
4063    20154620    TRUSEQ.v1   20114578    F250v1  0.3732692
53      20154620    TRUSEQ.v1   20104647    F250v1  0.3663569
12486   20165119    F250v1      20135032    F250v1  0.6091486
6471    20165119    F250v1      20115096    F250v1  0.388435
14892   20165119    F250v1      20135122    F250v1  0.345238

So in the subset, each animal1 might have between 1 and 20 observations but will have <=n unique values of animal2 where n=3 in this case.

I can sort the df by sim and animal1 like this

mydf <- mydf[order(-xtfrm(mydf[,"animal1"]), -mydf[,"sim"]),]

I can grab the first n observations per animal1 like this

mydf2 <- by(mydf, mydf["animal1"], head, n=1)
mydf2 <- Reduce(rbind, mydf2)

But how do I apply n to a third column, animal2 rather than the number of observations? Apologies if this is a duplicate, the answer is probably hidden in here, how to find the top N values by group or within category (groupwise) in an R data.frame but I just can't seem to stitch together a solution to my problem from the answers.

I think the data.table package has functionality like this.https://stackoverflow.com/questions/14800161/how-to-find-the-top-n-values-by-group-or-within-category-groupwise-in-an-r-dat — Balter, Jan 18 '18 at 21:08
You said you wanted *"top 3 unique animal2 values by highest sim"*, yet your expected output has 4 for `20144516`, is that a snafu or an unstated requirement? — r2evans, Jan 18 '18 at 21:10
The three unique animal2 values for animal1=20144516 are 20115051, 20135047, and 20115059. Should be right. Animal2=20115051 just has two records. — Aaron Dickey, Jan 18 '18 at 21:45

r2evans · Answer 1 · 2018-01-18T21:21:26.697

Notwithstanding my question about animal "20144516", here are a couple solutions, using dat as your sample data above (included at the bottom for reproducibility). I'm offering base-R and dplyr, though as suggested in a comment by @Balter, there is likely a straight-forward data.table methodology as well.

Base R

# ordering by animal1 is not necessary, sim is priority
dat <- dat[rev(order(dat$sim)),]
dat2 <- do.call(rbind, by(dat, list(dat$animal1, dat$animal2), head, n = 1))
# ... but we need to re-sort by sim, since the ordering is lost with `by`
dat2 <- dat2[rev(order(dat2$sim)),]
head(dat2)
#        animal1  version1  animal2 version2       sim
# 5663  20144516 TRUSEQ.v1 20115051   F250v1 0.6164795
# 12486 20165119    F250v1 20135032   F250v1 0.6091486
# 854   20145687 TRUSEQ.v1 20105551   F250v1 0.5732854
# 6471  20165119    F250v1 20115096   F250v1 0.3884350
# 4464  20154620 TRUSEQ.v1 20114595   F250v1 0.3772367
# 4063  20154620 TRUSEQ.v1 20114578   F250v1 0.3732692

This gives us the top 1 pairing of animal1 and animal2, sorted (descending) by sim. Now we effectively repeat the process with animal1 only:

dat3 <- do.call(rbind, by(dat, list(dat$animal1), head, n = 3))
dat3
#                 animal1  version1  animal2 version2       sim
# 20144516.5663  20144516 TRUSEQ.v1 20115051   F250v1 0.6164795
# 20144516.5262  20144516 TRUSEQ.v1 20115051  770k.v1 0.6034206
# 20144516.13282 20144516 TRUSEQ.v1 20135047   F250v1 0.3098507
# 20145008       20145008 TRUSEQ.v1 20115051   F250v1 0.3146651
# 20145687       20145687 TRUSEQ.v1 20105551   F250v1 0.5732854
# 20154620.4464  20154620 TRUSEQ.v1 20114595   F250v1 0.3772367
# 20154620.4063  20154620 TRUSEQ.v1 20114578   F250v1 0.3732692
# 20154620.53    20154620 TRUSEQ.v1 20104647   F250v1 0.3663569
# 20165119.12486 20165119    F250v1 20135032   F250v1 0.6091486
# 20165119.6471  20165119    F250v1 20115096   F250v1 0.3884350
# 20165119.14892 20165119    F250v1 20135122   F250v1 0.3452380

(The rownames are munged, unfortunately. If they are meaningful, I suggest you place rownames(dat) into a column and preserve it there.)

`dplyr`

You can also use dplyr.

library(dplyr)
dat %>%
  group_by(animal1, animal2) %>%
  top_n(1, wt = sim) %>%
  group_by(animal1) %>%
  top_n(3, wt = sim) %>%
  ungroup()
# # A tibble: 11 × 5
#     animal1  version1  animal2 version2       sim
#       <int>    <fctr>    <int>   <fctr>     <dbl>
# 1  20144516 TRUSEQ.v1 20115051   F250v1 0.6164795
# 2  20165119    F250v1 20135032   F250v1 0.6091486
# 3  20145687 TRUSEQ.v1 20105551   F250v1 0.5732854
# 4  20165119    F250v1 20115096   F250v1 0.3884350
# 5  20154620 TRUSEQ.v1 20114595   F250v1 0.3772367
# 6  20154620 TRUSEQ.v1 20114578   F250v1 0.3732692
# 7  20154620 TRUSEQ.v1 20104647   F250v1 0.3663569
# 8  20165119    F250v1 20135122   F250v1 0.3452380
# 9  20145008 TRUSEQ.v1 20115051   F250v1 0.3146651
# 10 20144516 TRUSEQ.v1 20135047   F250v1 0.3098507
# 11 20144516 TRUSEQ.v1 20115059   F250v1 0.3043295

PS: it can be significantly faster to use do.call(rbind, ...) in place of Reduce(rbind, ...):

library(microbenchmark)
x <- by(dat, list(dat$animal1, dat$animal2), head, n = 1)
microbenchmark(
  docall = do.call(rbind, x),
  reduce = Reduce(rbind, x)
)
# Unit: milliseconds
#    expr       min        lq      mean    median        uq       max neval
#  docall  1.418577  1.493335  1.809469  1.551136  1.731466  5.216277   100
#  reduce 11.119961 11.829614 13.302388 12.727255 13.401535 26.897520   100

This difference increases with a higher number of distinct animals. (This is because the Reduce method calls rbind once for each unique animal, whereas do.call is calling rbind only once.)

The sample data used here:

dat <- structure(list(animal1 = c(20154620L, 20145687L, 20154620L, 20154620L, 
20154620L, 20144516L, 20144516L, 20145008L, 20144516L, 20165119L, 
20154620L, 20154620L, 20165119L, 20144516L, 20165119L, 20165119L
), version1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("F250v1", "TRUSEQ.v1"
), class = "factor"), animal2 = c(20104647L, 20105551L, 20114509L, 
20114578L, 20114595L, 20115051L, 20115051L, 20115051L, 20115059L, 
20115096L, 20118095L, 20134529L, 20135032L, 20135047L, 20135072L, 
20135122L), version2 = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("770k.v1", "F250v1"
), class = "factor"), sim = c(0.3663569, 0.5732854, 0.3374918, 
0.3732692, 0.3772367, 0.6034206, 0.6164795, 0.3146651, 0.3043295, 
0.388435, 0.3079702, 0.3188631, 0.6091486, 0.3098507, 0.3025007, 
0.345238)), .Names = c("animal1", "version1", "animal2", "version2", 
"sim"), class = "data.frame", row.names = c("53", "854", "3662", 
"4063", "4464", "5262", "5663", "5664", "6064", "6471", "9677", 
"11281", "12486", "13282", "14090", "14892"))

Jason · Accepted Answer · 2018-01-18T22:05:00.617

The code below will include multiple animal1-animal2 combinations, only if the sim value for the lower animal1-animal2 entry "would be" in the top 3. Let me know if I've misinterpreted.

library(dplyr)
selected <- dat %>% 
  arrange(animal1,animal2,desc(sim)) %>%
  group_by(animal1,animal2) %>% 
  mutate(rank=row_number()) %>%
  filter(rank==1) %>% ungroup() %>%
  group_by(animal1) %>%
  top_n(3,sim) %>% 
  summarise(threshold = min(sim))

out <- dat %>% 
  inner_join(selected, by = c("animal1"="animal1")) %>%
  filter(sim>=threshold) %>%
  arrange(animal1,animal2,desc(sim)) %>%
  select(-threshold)
> out
    animal1  version1  animal2 version2       sim
1  20144516 TRUSEQ.v1 20115051   F250v1 0.6164795
2  20144516 TRUSEQ.v1 20115051  770k.v1 0.6034206
3  20144516 TRUSEQ.v1 20115059   F250v1 0.3043295
4  20144516 TRUSEQ.v1 20135047   F250v1 0.3098507
5  20145008 TRUSEQ.v1 20115051   F250v1 0.3146651
6  20145687 TRUSEQ.v1 20105551   F250v1 0.5732854
7  20154620 TRUSEQ.v1 20104647   F250v1 0.3663569
8  20154620 TRUSEQ.v1 20114578   F250v1 0.3732692
9  20154620 TRUSEQ.v1 20114595   F250v1 0.3772367
10 20165119    F250v1 20115096   F250v1 0.3884350
11 20165119    F250v1 20135032   F250v1 0.6091486
12 20165119    F250v1 20135122   F250v1 0.3452380

As I understand this solution, it works via a threshold minimum sim value calculated for each animal1-animal2 combo where the animal1-animal2 combo is the criteria for setting n=3. Is that right? Works like a charm! — Aaron Dickey, Jan 19 '18 at 14:37
@AaronDickey - Yep, that's right. This seemed like it'd fit your intent of including redundant pairs if they happen to be above the third unique value. Hope it helps — Jason, Jan 19 '18 at 14:44

subset dataframe by multiple column names and return top n hits

2 Answers2

Base R

dplyr

`dplyr`