1

I have some large dataset (more than 500 000 rows) and I want to filter it in R. I just want to retain the most relevant information so I thought that it would be a good idea to just save the rows whose elements have an occurrence greater than some value. For example I have this data:

A     B
2     5
4     7
2     8
3     7
2     9
4     2
1     0

And I want to retain the rows whose element of the A row has an occurrence greater than 1. In this case the output will be:

    A     B
    2     5
    4     7
    2     8
    2     9
    4     2

I know how to do it with for loops and rbind but since the dataset I am using is very big the performance is greatly hindered. Any advice?

zx8754
  • 52,746
  • 12
  • 114
  • 209
John
  • 151
  • 3
  • 7

1 Answers1

2

We can do this using either data.table, dplyr or base R methods. By using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'A', if the nrows are greater than 1, we get the Subset of Data.table (.SD).

library(data.table)
setDT(df1)[, if(.N>1) .SD, by = A]

Or we use dplyr. We group by 'A', filter the groups that have nrows greater than 1 (n() >1)

library(dplyr)
df1 %>%
   group_by(A) %>%
   filter(n()>1)

Or using ave from base R, we get a logical index and use that to subset the dataset

 df1[with(df1, ave(seq_along(A), A, FUN=length))> 1,]

Or without using any groupings, we can use duplicated to get the index and subset

df1[duplicated(df1$A)|duplicated(df1$A, fromLast=TRUE),]
akrun
  • 874,273
  • 37
  • 540
  • 662