Subset number of rows per group based on a value in another dataframe

Question

I have read through similar questions, but mine is slightly different. I have a dataframe (df1) with more than 3 million rows, 1874 species (scientific_name) and total value.

I also have another dataframe (df2) which provides the number of rows I want to keep per species (in total around 2 million rows).

What I would like to do is subset/filter df1 as per the number of rows specified in df2, keeping only those rows with the highest total value. E.g. let imagine that Cypraeidae in df2 n.at.70 = 1104 (rather than 1), so I would like the resulting df to retain 1104 rows starting with the highest total value to the 1104th highest total value for that species (scientific_name).

I have been unable to achieve this for one species, let alone come up with an effective 'apply' or 'for' loop, so any help would be greatly appreciated, I am relatively new to R.

[See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with, including a workable sample of data, not pictures of it. It's not entirely clear what you're trying to do, mostly because the 2 pictures of data you included don't seem to recreate the issue — camille, Oct 05 '20 at 14:50

Ben · Answer 1 · 2020-10-05T19:48:09.203

1

It sounds like you are manipulating large data sets - a data.table solution might be faster and preferable.

One approach is to merge in the n.at.70 value to df1, put in descending order of total, and then show the top n.at.70 results using head.

library(data.table)

setDT(df1)
setDT(df2)

setkey(df1, scientific_name)
setkey(df2, scientific_name)

df2[df1, on = "scientific_name"][
  order(-total), head(.SD, n.at.70[1]), by = scientific_name]

A tidyverse equivalent might be the following:

library(tidyverse)

df1 %>%
  left_join(df2) %>%
  group_by(scientific_name) %>%
  arrange(scientific_name, desc(total)) %>%
  slice(1:n.at.70[1])

edited Oct 05 '20 at 19:48

answered Oct 05 '20 at 19:25

Ben

28,684
5
23
45

Thank you very much, the tidyverse equivalent is excellent. I am just reading about slice, very handy! – TJeff Oct 05 '20 at 23:23
You can also check out `top_n` as well. – Ben Oct 06 '20 at 14:22

Subset number of rows per group based on a value in another dataframe

1 Answers1