select top n values by group with n depending on other value in data frame

Question

I'm quite new to r and coding in general. Your help would be highly appreciated :)

I'm trying to select the top n values by group with n depending on an other value (in the following called factor) from my data frame. Then, the selected values shoud be summarised by group to calculate the mean (d100). My goal is to get one value for d100 per group.

(Background: In forestry there is an indicator called d100 which is the mean diameter of the 100 thickest trees per hectare. If the size of the sampling area is smaller than 1 ha you need to select accordingly fewer trees to calculate d100. That's what the factor is for.)

First I tried to put the factor inside my dataframe as an own column. Then I thought maybe it would help to have something like a "lookup-table", because R said, that n must be a single number. But I don't know how to create a lookup-function. (See last part of the sample code.) Or maybe summarising df$factor before using it would do the trick?

Sample data:

(I indicated expressions where I'm not sure how to code them in R like this: 'I dont know how')

# creating sample data
library(tidyverse)

df <- data.frame(group = c(rep(1, each = 5), rep(2, each = 8), rep(3, each = 10)),
                 BHD = c(rnorm(23, mean = 30, sd = 5)),
                 factor = c(rep(pi*(15/100)^2, each = 5), rep(pi*(20/100)^2, each = 8), rep(pi*(25/100)^2, each = 10))
                )

# group by ID, then select top_n values of df$BHD with n depending on value of df$factor
df %>% 
  group_by(group) %>% 
  slice_max(
    BHD, 
    n = 100*df$factor, 
    with_ties = F) %>% 
  summarise(d100 = mean('sliced values per group'))

# other thought: having a "lookup-table" for the factor like this:
lt <- data.frame(group = c(1, 2, 3),
                 factor = c(pi*(15/100)^2, pi*(20/100)^2, pi*(25/100)^2))

# then
df %>% 
  group_by(group) %>% 
  slice_max(
    BHD, 
    n = 100*lt$factor 'where lt$group == df$group', 
    with_ties = F) %>% 
  summarise(d100 = mean('sliced values per group'))

I already found this answer to a problem which seems similar to mine, but it didn't quite help.

Would something like this help you https://stackoverflow.com/questions/12925063/numbering-rows-within-groups-in-a-data-frame/50906379#50906379 — hannes101, Apr 16 '21 at 09:50

score 0 · Answer 1 · answered Apr 16 '21 at 09:51

Since all the factor values are the same within each group, you can select any one factor value.

library(dplyr)

df %>% 
  group_by(group) %>% 
  top_n(BHD, n = 100* first(factor))  %>%
  ungroup 

#   group   BHD factor
#   <dbl> <dbl>  <dbl>
# 1     1  25.8 0.0707
# 2     1  24.6 0.0707
# 3     1  27.6 0.0707
# 4     1  28.3 0.0707
# 5     1  29.2 0.0707
# 6     2  28.8 0.126 
# 7     2  39.5 0.126 
# 8     2  23.1 0.126 
# 9     2  27.9 0.126 
#10     2  31.7 0.126 
# … with 13 more rows

select top n values by group with n depending on other value in data frame

1 Answers1