Subset rows by a column variable created on the fly with data.table

Asked Jul 13 '15 at 19:39

Active Jul 13 '15 at 19:42

Viewed 48 times

I'm using data.table to subset the rows of my data.table based on the frequency of a variable (only keeping rows when a value for the given column variable occurs over a threshold of times).

The code below works. I like how it's all calculated on the fly without an intermediate assignment cluttering the namespace, but including dt inside of another dt seems un-data.table-like? I'm wondering if there's a more efficient method improving elegance and/or performance.

Perhaps an approach where the the function (.N, table, length or otherwise) is in the J argument and the row subsetting based on these values is in the I argument of a single data.table call?

reference for the example

dt <- data.table(mtcars)
print(table(dt[,cyl]))
#  4  6  8 
# 11  7 14

data.table code

keeping just rows where any value of cyl occurs more than 10 times (4 and 8 cylinder cars in this case)

library('data.table')
dt <- data.table(mtcars)
dt[cyl %in% dt[,.N, by=cyl][N>10, cyl],]

edited Jul 13 '15 at 19:42

Frank

66,179
8
96
180

asked Jul 13 '15 at 19:39

ajb

2

You can do `dt[, if (.N>10) .SD, by=cyl]` or `dt[dt[,.I[.N>10],by=cyl]$V1]`. There's a feature request open for a `having` argument that might make this look cleaner: https://github.com/Rdatatable/data.table/issues/788 – Frank Jul 13 '15 at 19:45
1

Exactly what i was looking for. Thanks Frank. – ajb Jul 13 '15 at 19:53

Subset rows by a column variable created on the fly with data.table

reference for the example

data.table code

0 Answers0