5

R's data.table package offers fast subsetting of values based on keys.

So, for example:

set.seed(1342)

df1 <- data.table(group = gl(10, 10, labels = letters[1:10]),
                  value = sample(1:100))
setkey(df1, group)

df1["a"]

will return all rows in df1 where group == "a".

What if I want all rows in df1 where group != "a". Is there a concise syntax for that using data.table?

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Erik Iverson
  • 892
  • 4
  • 15

2 Answers2

8

I think you answered your own question:

> nrow(df1[group != "a"])
[1] 90
> table(df1[group != "a", group])

 a  b  c  d  e  f  g  h  i  j 
 0 10 10 10 10 10 10 10 10 10 

Seems pretty concise to me?

EDIT FROM MATTHEW : As per comments this a vector scan. There is a not join idiom here and here, and feature request #1384 to make it easier.

EDIT: feature request #1384 is implemented in data.table 1.8.3

df1[!'a']

# and to avoid the character-to-factor coercion warning in this example (where
# the key column happens to be a factor) :
df1[!J(factor('a'))]
Community
  • 1
  • 1
Chase
  • 67,710
  • 18
  • 144
  • 161
  • 1
    But `!=` is like `==` i.e. _vector scans_. Instead, there is a _not join_ idiom in [this question](http://stackoverflow.com/questions/7920688/non-joins-with-data-tables) and [this question](http://stackoverflow.com/questions/7822138/porting-set-operations-from-rs-data-frames-to-data-tables-how-to-identify-dupl). Those link to a feature request to make not-join syntax even easier. In this case it would be `df1[-"a"]`. The not-join idiom should be faster than vector scanning. – Matt Dowle Apr 10 '12 at 10:52
  • Yes, Matt, as I suspected, the above solutions do use vector scans, which I'd love to avoid if possible. I still did note a speed increase compared to similarly sized data.frames in my comment below, but I'll have to investigate why that is. In the mean time, you gave me the right search terms and alternative formulations of the question, and you clearly understand what I'm after. Thanks for all your hard work on this great package. – Erik Iverson Apr 10 '12 at 21:09
1

I would just get all keys that are not "a":

df1[!(group %in% "a")]

Does this achieve what you want?

Christoph_J
  • 6,804
  • 8
  • 44
  • 58
  • 2
    Or, alternatively, `df1[group != "a"]`. What I'd be curious to know is whether there are important speed differences between our two expressions and: `df1[setdiff(unique(df1$group), "a")]` or `df1[letters[2:10]]`. – Josh O'Brien Apr 03 '12 at 15:55
  • @JoshO'Brien Yep, I definitely took the complicated road here ;-) So I would go with Chase's or your solution as well. – Christoph_J Apr 03 '12 at 15:58
  • Thank you all, I appreciate the help. All solutions listed thus far are very comparable with the size of my data, and about twice as fast as the similar techniques with data.frames. – Erik Iverson Apr 03 '12 at 16:14
  • @Josh Re speed differences, see not-join idiom in links in my comment to accepted answer, to add to the mix. A not-join _should_ be fastest, but whether it is or not I'm not sure (could need some follow up and fixes). – Matt Dowle Apr 10 '12 at 11:02