2

df:-

Date    Name  Salary 
Q1 2015 ABC   $10
Q2 2015 ABC   $11
Q3 2015 ABC   $15
Q1 2015 XYZ   $25
Q2 2015 XYZ   $20

I want to remove the rows from the data whose total frequency is less than 3. For e.g. XYZ have a frequency of 2 and so I want to remove row 4 and 5.

test <- setDT(df)[,.I[.N>2],by=Name]

Output:-

> test
   Name V1
1:  ABC  1
2:  ABC  2
3:  ABC  3

Filtering is done correctly but I don't get the whole data set, I only get the Name column in the output.

Meesha
  • 801
  • 1
  • 9
  • 27

1 Answers1

5

We need to extract the 'V1' column and use it as row index in 'i' to subset the rows.

setDT(df)[df[,.I[.N>2],by=Name]$V1]
#       Date Name Salary
#1: Q1 2015  ABC    $10
#2: Q2 2015  ABC    $11
#3: Q3 2015  ABC    $15

Or a concise option with if and .SD

setDT(df)[, if(.N >2) .SD, by = Name]
#    Name    Date Salary
#1:  ABC Q1 2015    $10
#2:  ABC Q2 2015    $11
#3:  ABC Q3 2015    $15

Just in case, if we need a dplyr method

library(dplyr)
df %>%
   group_by(Name) %>%
   filter(n() >2 )
#      Date  Name Salary
#     <chr> <chr>  <chr>
#1 Q1 2015   ABC    $10
#2 Q2 2015   ABC    $11
#3 Q3 2015   ABC    $15

Or with base R, we can have a number of options, one with ave

df[with(df, ave(seq_along(Name), Name, FUN = length)>2),]

or using table

tbl <- table(df$Name)> 2
subset(df, Name %in% names(tbl)[tbl])
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    I suspect the `.SD` way is efficient now, per the first item in https://github.com/Rdatatable/data.table/issues/735 I think I might be misreading that, though... I'd be curious to see whether that holds or not. – Frank Aug 22 '16 at 17:56
  • 1
    @Frank I am using the devel version. Doing some benchmarks recently is favoring the `.I` option. – akrun Aug 22 '16 at 17:58