0

I have seen similar questions, but couldn't apply them to solve my problem. I want a function to filter groups with less than 3 observations. As I wanted that for several dataframes, I need a function. I did it with dplyr and merge, but I would like a better code that uses only dplyr or datatable.

data <- read.table(text="
col1       col2 
group1   some  
group1   some2     
group1   some3
group2   some  
group2   some2",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)


filter3 <- function(df, colgroup1) {
  df %>%
    group_by_(colgroup1) %>%
    summarise_(Count = ~n()) #%>%
}
great3<-function(x, col){
  x[x[,col] >=3 ,]
}

allfiltered<-function(df,colgroup1){
  counts<-filter3(df,colgroup1)
  final<-great3(counts,"Count")
  merge(df,final, by=colgroup1)}

allfiltered(data,"col1")

#expected, count column dispensable, (function for 1 df or list of dfs wanted)
    col1  col2 Count
1 group1  some     3
2 group1 some2     3
3 group1 some3     3
Ferroao
  • 3,042
  • 28
  • 53
  • Regarding data.table, see http://stackoverflow.com/q/36869784/ Essentially, `f = function(d, col, n = 3) setDT(d)[, if (.N >= n) .SD, by=col]` which can be used like `f(data, "col1")` Also http://stackoverflow.com/q/39085450/ – Frank Mar 08 '17 at 20:14

2 Answers2

3

You can just use group_by %>% filter, also see ?n() for related examples:

data %>% group_by(col1) %>% filter(n() >= 3)

#Source: local data frame [3 x 2]
#Groups: col1 [1]

#    col1  col2
#   <chr> <chr>
#1 group1  some
#2 group1 some2
#3 group1 some3

To wrap this in a function:

allfiltered <- function(data, colgroup1) { 
    data %>% 
        group_by_(.dots = colgroup1) %>% 
        filter(n() >= 3) 
}
Psidom
  • 209,562
  • 33
  • 339
  • 356
1

In base R we can use split, Filter and rbind:

allfiltered <- function(df, colGroup) {

    d <- split(df, as.factor(df[, colGroup]))

    l <- Filter(function(l) nrow(l) >= 3, d)

    do.call(rbind, l)
}

This split the data.frame into a list of data.frames, then filter the elements that satisfy the condition and finally unsplit the list:

allfiltered(data, 'col1')
# $group1
#    col1  col2
# 1 group1  some
# 2 group1 some2
# 3 group1 some3
GGamba
  • 13,140
  • 3
  • 38
  • 47