Filter dataset based on occurrence

Question

I have some large dataset (more than 500 000 rows) and I want to filter it in R. I just want to retain the most relevant information so I thought that it would be a good idea to just save the rows whose elements have an occurrence greater than some value. For example I have this data:

And I want to retain the rows whose element of the A row has an occurrence greater than 1. In this case the output will be:

I know how to do it with for loops and rbind but since the dataset I am using is very big the performance is greatly hindered. Any advice?

akrun · Accepted Answer · 2015-10-09T12:19:58.537

We can do this using either data.table, dplyr or base R methods. By using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'A', if the nrows are greater than 1, we get the Subset of Data.table (.SD).

library(data.table)
setDT(df1)[, if(.N>1) .SD, by = A]

Or we use dplyr. We group by 'A', filter the groups that have nrows greater than 1 (n() >1)

library(dplyr)
df1 %>%
   group_by(A) %>%
   filter(n()>1)

Or using ave from base R, we get a logical index and use that to subset the dataset

 df1[with(df1, ave(seq_along(A), A, FUN=length))> 1,]

Or without using any groupings, we can use duplicated to get the index and subset

df1[duplicated(df1$A)|duplicated(df1$A, fromLast=TRUE),]

Filter dataset based on occurrence

1 Answers1