I'm using data.table
to subset the rows of my data.table based on the frequency of a variable (only keeping rows when a value for the given column variable occurs over a threshold of times).
The code below works. I like how it's all calculated on the fly without an intermediate assignment cluttering the namespace, but including dt
inside of another dt
seems un-data.table-like? I'm wondering if there's a more efficient method improving elegance and/or performance.
Perhaps an approach where the the function (.N
, table
, length
or otherwise) is in the J argument and the row subsetting based on these values is in the I argument of a single data.table call?
reference for the example
dt <- data.table(mtcars)
print(table(dt[,cyl]))
# 4 6 8
# 11 7 14
data.table code
keeping just rows where any value of cyl
occurs more than 10 times (4 and 8 cylinder cars in this case)
library('data.table')
dt <- data.table(mtcars)
dt[cyl %in% dt[,.N, by=cyl][N>10, cyl],]