I have a dataset with some columns and a grouping variable. I want to reduce the dataset per grouping variable, max_n
rows per grouping level. At the same time I want to keep the distribution of the other columns. What I mean by that is that I want to keep the lowest and highest values of a
and b
after the data has been filtered. That is why I use the function setorderv
below.
library(data.table)
set.seed(22)
n=20
max_n = 6
dt <- data.table("grp"=sample(c("a", "b", "c"), n, replace=T),
"a"=sample(1:10, n, replace=T),
"b"=sample(1:20, n, replace=T),
"id"=1:n)
setorderv(dt, c("grp", "a", "b"))
dt
My temporary solution, which is not very elegant or data.table-ish, goes like this:
dt_new <- data.table()
for (gr in unique(dt[["grp"]])) {
tmp <- dt[grp == gr, ]
n_tmp <- nrow(tmp)
if (n_tmp > max_n) {
tmp <- tmp[as.integer(seq(1, n_tmp, length.out=max_n)),]
}
dt_new <- rbindlist(list(dt_new, tmp))
}
Is there a more elegant way of doing this? EDIT: I want a data.table solution.
Code right now is too bulky