16

I have a data frame df, and I am trying to subset all rows that have a value in column B occur more than once in the dataset.

I tried using table to do it, but am having trouble subsetting from the table:

t<-table(df$B)

Then I try subsetting it using:

subset(df, table(df$B)>1)

And I get the error

"Error in x[subset & !is.na(subset)] : object of type 'closure' is not subsettable"

How can I subset my data frame using table counts?

Asmita Poddar
  • 524
  • 1
  • 11
  • 27
Chris Robles
  • 193
  • 1
  • 2
  • 9
  • I was looking at the plyr function count as well, but that creates an entirely new data frame. – Chris Robles Jul 01 '14 at 05:56
  • 1
    Well, all of these methods will create a new data.frame (or data.table). R generally doesn't modify objects in place. – MrFlick Jul 01 '14 at 06:24

3 Answers3

34

Here is a dplyr solution (using mrFlick's data.frame)

library(dplyr)
newd <-  dd %>% group_by(b) %>% filter(n()>1) #
newd
#    a b 
# 1  1 1 
# 2  2 1 
# 3  5 4 
# 4  6 4 
# 5  7 4 
# 6  9 6 
# 7 10 6 

Or, using data.table

setDT(dd)[,if(.N >1) .SD,by=b]

Or using base R

dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]
mnel
  • 113,303
  • 27
  • 265
  • 254
  • 2
    dd%>%group_by(b) %>%filter(n()>1) #should also work. Or dd%>%group_by(b) %>%filter(length(row_number())>1) – akrun Jul 01 '14 at 07:40
6

May I suggest an alternative, faster way to do this with data.table?

require(data.table) ## 1.9.2
setDT(df)[, .N, by=B][N > 1L]$B

(or) you can couple .I (another special variable - see ?data.table) which gives the corresponding row number in df, along with .N as follows:

setDT(df)[df[, .I[.N > 1L], by=B]$V1]

(or) have a look at @mnel's another for another variation (using yet another special variable .SD).

Arun
  • 116,683
  • 26
  • 284
  • 387
Mike.Gahan
  • 4,565
  • 23
  • 39
5

Using table() isn't the best because then you have to rejoin it to the original rows of the data.frame. The ave function makes it easier to calculate row-level values for different groups. For example

dd<-data.frame(
    a=1:10,
    b=c(1,1,2,3,4,4,4,5,6, 6)
)


dd[with(dd, ave(b,b,FUN=length))>1, ]
#subset(dd, ave(b,b,FUN=length)>1)    #same thing

    a b
1   1 1
2   2 1
5   5 4
6   6 4
7   7 4
9   9 6
10 10 6

Here, for each level of b, it counts the length of b, which is really just the number of b's and returns that back to the appropriate row for each value. Then we use that to subset.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • `with(dd, dd[b %in% names(table(b))[table(b) > 1],])` would allow you to use `table`, albeit not as cleanly. – thelatemail Jul 01 '14 at 06:15
  • Right. I hate having to call it twice like that. This is one of places I might use a [withX()](https://gist.github.com/MrFlick/35ff2265619762d49159) function: `dd[dd$b %in% withX(table(dd$b), names(X[X>1])), ]`. Or you could even use `names(Filter(function(x) x>1, table(dd$b)))` to get the names out. But that still converts them to character. But that's why I say it's "not the best" rather than not possible. – MrFlick Jul 01 '14 at 06:19