6

If I have a dataframe like this:

neu <- data.frame(test1 = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14), 
                  test2 = c("a","b","a","b","c","c","a","c","c","d","d","f","f","f"))
neu
   test1 test2
1      1     a
2      2     b
3      3     a
4      4     b
5      5     c
6      6     c
7      7     a
8      8     c
9      9     c
10    10     d
11    11     d
12    12     f
13    13     f
14    14     f

and I would like to select only those values where the level of the factor test2 appears more than let's say three times, what would be the fastest way?

Thanks very much, didn't really find the right answer in the previous questions.

Arun
  • 116,683
  • 26
  • 284
  • 387

4 Answers4

7

Find the rows using:

z <- table(neu$test2)[table(neu$test2) >= 3] # repeats greater than or equal to 3 times

Or:

z <- names(which(table(neu$test2)>=3))

Then subset with:

subset(neu, test2 %in% names(z))

Or:

neu[neu$test2 %in% names(z),]
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • Why use `as.list`? Why two `table(.)`? And it's better not to use `subset`. – Arun May 16 '13 at 11:58
  • See alternative strategies above. – Thomas May 16 '13 at 12:48
  • @Arun, do you mind explaining a little or linking to a reason for your suggestion not to use ``subset``. The thing about ``subset`` is that it is intuitively named, so it's more natural to think "I'm going to subset that" rather than, say, "I'm going to use ``%in%``" or "I'm going to ``which`` it ... – PatrickT Dec 09 '14 at 12:08
  • 1
    @PatrickT `subset` uses non-standard evaluation, so it can produce unexpected results. For example, if you use it inside a function, it will typically not work right or at all. Best advice is to use `[` for all extraction. – Thomas Dec 09 '14 at 12:50
  • 1
    @PatrickT, at the time I was probably influenced by [this post](http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset). While there are some valid points there, I don't see a reason not to use, as long as you know what you're doing. I tend to avoid suggesting "never use this/that" these days. – Arun Dec 09 '14 at 12:51
  • Thanks @Thomas, will try to remember that. – PatrickT Dec 09 '14 at 13:05
  • Thanks for the link @Arun. Sounds scary! Good question/answer you linked to. – PatrickT Dec 09 '14 at 13:05
5

Here's another way:

 with(neu, neu[ave(seq(test2), test2, FUN=length) > 3, ])

#   test1 test2
# 5     5     c
# 6     6     c
# 8     8     c
# 9     9     c
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
3

I'd use count from the plyr package to perform the counting:

library(plyr)
count_result = count(neu, "test2")
matching = with(count_result, test2[freq > 3])
with(neu, test1[test2 %in% matching])
[1] 5 6 8 9
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
2

The (better scaling) data.table way:

library(data.table)
dt = data.table(neu)

dt[dt[, .I[.N >= 3], by = test2]$V1]

Note: hopefully, in the future, the following simpler syntax will be the fast way of doing this:

dt[, .SD[.N >= 3], by = test2]

(c.f. Subset by group with data.table)

Community
  • 1
  • 1
eddi
  • 49,088
  • 6
  • 104
  • 155