2

I'm using data.table to subset the rows of my data.table based on the frequency of a variable (only keeping rows when a value for the given column variable occurs over a threshold of times).

The code below works. I like how it's all calculated on the fly without an intermediate assignment cluttering the namespace, but including dt inside of another dt seems un-data.table-like? I'm wondering if there's a more efficient method improving elegance and/or performance.

Perhaps an approach where the the function (.N, table, length or otherwise) is in the J argument and the row subsetting based on these values is in the I argument of a single data.table call?

reference for the example

dt <- data.table(mtcars)
print(table(dt[,cyl]))
#  4  6  8 
# 11  7 14 

data.table code

keeping just rows where any value of cyl occurs more than 10 times (4 and 8 cylinder cars in this case)

library('data.table')
dt <- data.table(mtcars)
dt[cyl %in% dt[,.N, by=cyl][N>10, cyl],]
Frank
  • 66,179
  • 8
  • 96
  • 180
ajb
  • 692
  • 6
  • 16
  • 2
    You can do `dt[, if (.N>10) .SD, by=cyl]` or `dt[dt[,.I[.N>10],by=cyl]$V1]`. There's a feature request open for a `having` argument that might make this look cleaner: https://github.com/Rdatatable/data.table/issues/788 – Frank Jul 13 '15 at 19:45
  • 1
    Exactly what i was looking for. Thanks Frank. – ajb Jul 13 '15 at 19:53

0 Answers0