How to find the average of n largest values by group and add as a new column in r data table

Question

I read a lot similar questions before asking a new one but here I am. I have a long data table that consist of plot, dbh, etc. An example of my data like this:

structure(list(plot = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), dbh = c(18L, 14L, 
13L, 20L, 20L, 15L, 9L, 12L, 22L, 21L, 14L, 14L, 13L, 18L, 24L, 
19L, 13L, 15L, 17L, 22L, 11L)), class = "data.frame", row.names = c(NA, 
-21L))

What I want to do is find the average of 5 largest values by group (plot) and add this values as a new column to the same data table. I'm expecting to get the following result.

structure(list(plot = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), dbh = c(18L, 14L, 
13L, 20L, 20L, 15L, 9L, 12L, 22L, 21L, 14L, 14L, 13L, 18L, 24L, 
19L, 13L, 15L, 17L, 22L, 11L), dom = c(17.4, 17.4, 17.4, 17.4, 
17.4, 17.4, 17.4, 17.4, 21.6, 21.6, 21.6, 21.6, 21.6, 21.6, 21.6, 
21.6, 21.6, 21.6, 21.6, 21.6, 21.6)), class = "data.frame", row.names = c(NA, 
-21L))

I will be appreciate for your help. Thanks.

PS: I tried many different codes within different packages (data.table, dplyr, etc..) however couldn't able to make it so I won't give any mwe that doesn't work.

How should be the ties treated? – tmfmnk Oct 08 '20 at 09:41 — tmfmnk, Oct 08 '20 at 09:41

Ronak Shah · Accepted Answer · 2020-10-08T09:47:02.083

You can use head/tail to get top 5 values :

df$dom <- with(df, ave(dbh, plot, FUN = function(x) mean(tail(sort(x), 5))))
#same as doing 1:5
#df$dom <- with(df, ave(dbh, plot, FUN = function(x) 
                    mean(sort(x, decreasing = TRUE)[1:5])))

Or using dplyr :

library(dplyr)
df %>% group_by(plot) %>% mutate(dom = mean(tail(sort(dbh), 5)))

and data.table :

library(data.table)
setDT(df)[, dom := mean(tail(sort(dbh), 5)), plot]
df

#    plot dbh  dom
# 1:    1  18 17.4
# 2:    1  14 17.4
# 3:    1  13 17.4
# 4:    1  20 17.4
# 5:    1  20 17.4
# 6:    1  15 17.4
# 7:    1   9 17.4
# 8:    1  12 17.4
# 9:    2  22 21.6
#10:    2  21 21.6
#11:    2  14 21.6
#12:    2  14 21.6
#13:    2  13 21.6
#14:    2  18 21.6
#15:    2  24 21.6
#16:    2  19 21.6
#17:    2  13 21.6
#18:    2  15 21.6
#19:    2  17 21.6
#20:    2  22 21.6
#21:    2  11 21.6
#    plot dbh  dom

dplyr also has slice_max function (previously top_n) to get top n values in each group.

df %>%
  group_by(plot) %>%
  slice_max(dbh, n = 5) %>%
  summarise(dom = mean(dbh)) %>%
  left_join(df, by = 'plot')

wow! I've been trying for two days. Thank you! – Onur Alkan Oct 08 '20 at 09:48 — Onur Alkan, Oct 08 '20 at 09:48

How to find the average of n largest values by group and add as a new column in r data table

1 Answers1