subset based on frequency level

Question

I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?

set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))

`count` is not a base R function. Your subset argument inside subset() should not be using `$` either. — IRTFM, Jul 18 '14 at 23:52

score 6 · Accepted Answer · answered Jul 18 '14 at 23:48

6

df1[ df1$ID %in%  names(table(df1$ID))[table(df1$ID) >9] , ]

This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.

See:

?`[`
?'%in%'

answered Jul 18 '14 at 23:48

IRTFM

258,963
21
364
487

Thanks, this works perfectly. I checked out the help documentation you recommended, but I'm not grasping it completely. Confused about role of names and table. Can you suggest a good online resource more suited to beginners? – user3614783 Jul 19 '14 at 00:51
I'm not sure where the confusion arises. If you have a one dimensional table, then the value of the "names" attribute is likewise one dimensional, i.e. an R vector. So the task is choosing which one (or more) of those character values is associated with values inside the numerical values (counts) that meet the criterion you set up in the logical test: `table(df1$ID) >9`. So that's the work done by the expressionto the left of the `%in%`. – IRTFM May 09 '21 at 20:17

score 6 · Answer 2 · answered Jul 19 '14 at 06:47

6

Using dplyr

library(dplyr)
 df1 %>% 
 group_by(ID) %>% 
 filter(n()>cutoff)

answered Jul 19 '14 at 06:47

akrun

874,273
37
540
662

score 5 · Answer 3 · answered Jul 18 '14 at 23:57

5

Maybe closer to what you had in mind is to create a vector of frequencies using ave:

subset(df1, ave(ID, ID, FUN = length) > cutoff)

answered Jul 18 '14 at 23:57

flodel

87,577
21
185
223

I know the +1 comments are discouraged, but this one is too elegant to resist. – IRTFM Jul 19 '14 at 00:33
Thank you, this works too and I like how compact it is. But can you explain how each of the two ID arguments work together? And how does ave act on what follows it? The R documentation says that this is averaging subsets of observations with the same factor level. Why averaging? – user3614783 Jul 19 '14 at 02:40
The first ID is just a marker to counts how many items there are in each factor level of ID. – IRTFM Jul 19 '14 at 03:46
`ave(arg1, arg2, FUN = length)` is essentially bucketing your data using `arg2`, then computing `length(arg1)` for each bucket, finally putting the results all back into one vector. So essentially `ave(ID, ID, FUN = length)` is giving you, for each row, the number of times its `ID` appears in the whole `ID` column. To help you understand, you might do it in two steps: `df1 <- transform(df1, freq = ave(ID, ID, FUN = length)); print(df1); subset(df1, freq > cutoff);` – flodel Jul 19 '14 at 11:05
This is helpful. I'm new to R and never would have thought of looking at this. The documentation says that ave groups averages. – user3614783 Jul 19 '14 at 18:48

subset based on frequency level

3 Answers3

Linked