0

I have read through similar questions, but mine is slightly different. I have a dataframe (df1) with more than 3 million rows, 1874 species (scientific_name) and total value.

enter image description here

I also have another dataframe (df2) which provides the number of rows I want to keep per species (in total around 2 million rows).

enter image description here

What I would like to do is subset/filter df1 as per the number of rows specified in df2, keeping only those rows with the highest total value. E.g. let imagine that Cypraeidae in df2 n.at.70 = 1104 (rather than 1), so I would like the resulting df to retain 1104 rows starting with the highest total value to the 1104th highest total value for that species (scientific_name).

I have been unable to achieve this for one species, let alone come up with an effective 'apply' or 'for' loop, so any help would be greatly appreciated, I am relatively new to R.

camille
  • 16,432
  • 18
  • 38
  • 60
TJeff
  • 29
  • 8
  • [See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with, including a workable sample of data, not pictures of it. It's not entirely clear what you're trying to do, mostly because the 2 pictures of data you included don't seem to recreate the issue – camille Oct 05 '20 at 14:50

1 Answers1

1

It sounds like you are manipulating large data sets - a data.table solution might be faster and preferable.

One approach is to merge in the n.at.70 value to df1, put in descending order of total, and then show the top n.at.70 results using head.

library(data.table)

setDT(df1)
setDT(df2)

setkey(df1, scientific_name)
setkey(df2, scientific_name)

df2[df1, on = "scientific_name"][
  order(-total), head(.SD, n.at.70[1]), by = scientific_name]

A tidyverse equivalent might be the following:

library(tidyverse)

df1 %>%
  left_join(df2) %>%
  group_by(scientific_name) %>%
  arrange(scientific_name, desc(total)) %>%
  slice(1:n.at.70[1])
Ben
  • 28,684
  • 5
  • 23
  • 45