Subsetting a dataframe by the amount of repetition

Question

If I have a dataframe like this:

neu <- data.frame(test1 = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14), 
                  test2 = c("a","b","a","b","c","c","a","c","c","d","d","f","f","f"))
neu
   test1 test2
1      1     a
2      2     b
3      3     a
4      4     b
5      5     c
6      6     c
7      7     a
8      8     c
9      9     c
10    10     d
11    11     d
12    12     f
13    13     f
14    14     f

and I would like to select only those values where the level of the factor test2 appears more than let's say three times, what would be the fastest way?

Thanks very much, didn't really find the right answer in the previous questions.

Thomas · Accepted Answer · 2013-05-16T12:44:52.110

7

Find the rows using:

z <- table(neu$test2)[table(neu$test2) >= 3] # repeats greater than or equal to 3 times

Or:

z <- names(which(table(neu$test2)>=3))

Then subset with:

subset(neu, test2 %in% names(z))

Or:

neu[neu$test2 %in% names(z),]

edited May 16 '13 at 12:44

answered May 16 '13 at 11:49

Thomas

43,637
12
109
140

Why use `as.list`? Why two `table(.)`? And it's better not to use `subset`. – Arun May 16 '13 at 11:58
See alternative strategies above. – Thomas May 16 '13 at 12:48
@Arun, do you mind explaining a little or linking to a reason for your suggestion not to use ``subset``. The thing about ``subset`` is that it is intuitively named, so it's more natural to think "I'm going to subset that" rather than, say, "I'm going to use ``%in%``" or "I'm going to ``which`` it ... – PatrickT Dec 09 '14 at 12:08
1

@PatrickT `subset` uses non-standard evaluation, so it can produce unexpected results. For example, if you use it inside a function, it will typically not work right or at all. Best advice is to use `[` for all extraction. – Thomas Dec 09 '14 at 12:50
1

@PatrickT, at the time I was probably influenced by [this post](http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset). While there are some valid points there, I don't see a reason not to use, as long as you know what you're doing. I tend to avoid suggesting "never use this/that" these days. – Arun Dec 09 '14 at 12:51
Thanks @Thomas, will try to remember that. – PatrickT Dec 09 '14 at 13:05
Thanks for the link @Arun. Sounds scary! Good question/answer you linked to. – PatrickT Dec 09 '14 at 13:05

score 5 · Answer 2 · answered May 16 '13 at 11:52

5

Here's another way:

 with(neu, neu[ave(seq(test2), test2, FUN=length) > 3, ])

#   test1 test2
# 5     5     c
# 6     6     c
# 8     8     c
# 9     9     c

answered May 16 '13 at 11:52

Matthew Plourde

43,932
7
96
113

+1 this is by far the best base solution to me. – Arun May 17 '13 at 09:36

score 3 · Answer 3 · answered May 16 '13 at 11:50

3

I'd use count from the plyr package to perform the counting:

library(plyr)
count_result = count(neu, "test2")
matching = with(count_result, test2[freq > 3])
with(neu, test1[test2 %in% matching])
[1] 5 6 8 9

answered May 16 '13 at 11:50

Paul Hiemstra

59,984
12
142
149

score 2 · Answer 4 · edited May 23 '17 at 12:21

2

The (better scaling) data.table way:

library(data.table)
dt = data.table(neu)

dt[dt[, .I[.N >= 3], by = test2]$V1]

Note: hopefully, in the future, the following simpler syntax will be the fast way of doing this:

dt[, .SD[.N >= 3], by = test2]

(c.f. Subset by group with data.table)

edited May 23 '17 at 12:21

Community

1
1

answered May 16 '13 at 14:47

eddi

49,088
6
104
155

Subsetting a dataframe by the amount of repetition

4 Answers4