3

I am trying to group a data.frame by a factor variable, and then return rows of the data.frame that correspond to observations that occur exactly once in each group. For example, consider the following data:

x = matrix(c(1,1,2,2,2,3,4,4,5,4), nrow = 5, ncol = 2, byrow = F)
x = data.frame(x)
x

#   X1 X2
# 1  1  3
# 2  1  4
# 3  2  4
# 4  2  5
# 5  2  4

I would like to group the data by the values in column 1, then return the rows for which the value in column 2 occurs only once within a group. Here, the function would return the first, second, and fourth rows.

Desired output

#   X1 X2
# 1  1  3
# 2  1  4
# 4  2  5

I am looking to apply this to a dataset with >1mm rows.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Blake
  • 141
  • 2
  • 6

3 Answers3

7

In base R, you can try ave:

x[with(x, ave(X2, X1, X2, FUN = length)) == 1, ]
#   X1 X2
# 1  1  3
# 2  1  4
# 4  2  5

Because ave scales very poorly when there are multiple groups and multiple grouping variables, you may want to create a new group first:

x[with(x, ave(X2, sprintf("%s__%s", X1, X2), FUN = length)) == 1, ]

The speeds will vary widely according to the nature of your data.

You can also try:

library(dplyr)
x %>%
  group_by(X1, X2) %>%
  filter(n() == 1)
# Source: local data frame [3 x 2]
# Groups: X1, X2 [3]
# 
#      X1    X2
#   (dbl) (dbl)
# 1     1     3
# 2     1     4
# 3     2     5
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Please refrain from voting on this answer. It is clearly a very naïve approach that won't scale nicely when the number of groups is large. This [is known](http://www.r-statistics.com/2011/08/comparison-of-ave-ddply-and-data-table/) (though that article hasn't been updated to show whether "dplyr" has improved). There is only one answer that deserves votes on this question. Please moderate yourselves and vote accordingly. Everyone will have a much more pleasant day as a result. Thanks! – A5C1D2H2I1M1N2O1R2T1 Mar 22 '16 at 03:40
  • That is a lengthy comment. But, i came to know that the `ave` solution is very fast. At first, I didn't try with `ave` as I thought `dplyr` will be fast and included only that. So, you deserve the votes. Though, I think I have to deal with this kind of bias. – akrun Mar 22 '16 at 03:58
  • I request you to put it back from communitywiki to your name as my comment was based on a misunderstanding with the dplyr solution – akrun Mar 22 '16 at 04:05
  • @akrun, CW is an irreversible decision. – A5C1D2H2I1M1N2O1R2T1 Mar 22 '16 at 04:08
  • @akrun, At least in the development version of "data.table", you can just do: `as.data.table(x1)[, .N, by = .(X1, X2)][N == 1]`. Not sure if that works with the CRAN version or not. Also, please undelete your answer. It is informative. The modified version of `ave` that I shared might be faster than the "dplyr" I shared, but not as fast as the "dplyr" you shared, or the other alternatives, base or otherwise. – A5C1D2H2I1M1N2O1R2T1 Mar 22 '16 at 05:30
  • Thanks for sharing that. yes, it works on the CRAN version, though, the N should be assigned to NULL later. I think I will let that in the delete mode. Anyway, you are using the CW version, which makes my post a bit laughable. You could include the solutions in my post. – akrun Mar 22 '16 at 05:33
4

We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(x), grouped by the first column i.e. "X1", if, there is only one observation, return the row else remove all the duplicates and return only the unique row.

library(data.table)
setDT(x)[, if(.N==1) .SD else 
   .SD[!(duplicated(X2)|duplicated(X2, fromLast=TRUE))], X1]
#   X1 X2
#1:  1  3
#2:  1  4
#3:  2  5

If we are using both "X1" and "X2" as grouping variable

setDT(x)[x[, .I[.N==1], .(X1, X2)]$V1]
#   X1 X2
#1:  1  3
#2:  1  4
#3:  2  5

NOTE: Data.table is very fast and is compact.


Or without using any group by option, with base R we can do

x[!(duplicated(x)|duplicated(x, fromLast=TRUE)),]
#  X1 X2
#1  1  3
#2  1  4
#4  2  5

Or with tally from dplyr

library(dplyr)
x %>%
  group_by_(.dots= names(x)) %>%
  tally() %>%
  filter(n==1) %>%
  select(-n)

Note that this should be faster than the other dplyr solution.

Benchmarks

library(data.table)
library(dplyr)

Sample data

set.seed(24)
x1 <- data.frame(X1 = sample(1:5000, 1e6, replace=TRUE), 
                 X2 = sample(1:10000, 1e6, replace=TRUE))
x2 <- copy(as.data.table(x1))

Base R approaches

system.time(x1[with(x1, ave(X2, sprintf("%s__%s", X1, X2), FUN = length)) == 1, ])
#    user  system elapsed 
#  20.245   0.002  20.280 

system.time(x1[!(duplicated(x1)|duplicated(x1, fromLast=TRUE)), ])
#    user  system elapsed 
#   1.994   0.000   1.998 

dplyr approaches

system.time(x1 %>% group_by(X1, X2) %>% filter(n() == 1))
#    user  system elapsed 
#  33.400   0.006  33.467 

system.time(x1 %>% group_by_(.dots= names(x2)) %>% tally() %>% filter(n==1) %>% select(-n))
#    user  system elapsed 
#   2.331   0.000   2.333 

data.table approaches

system.time(x2[x2[, .I[.N==1], list(X1, X2)]$V1])
#    user  system elapsed 
#   1.128   0.001   1.131 

system.time(x2[, .N, by = list(X1, X2)][N == 1][, N := NULL][])
#    user  system elapsed 
#   0.320   0.000   0.323

Summary: The "data.table" approaches win hands down, but if you're unable to use the package for some reason, using duplicated from base R also performs quite well.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I can't do anything about the 3 votes I have right now, but I've made my answer CW so I don't get any reputation from this. I'll add a bounty for 50 points to the question when I can (after 2 days, I believe) to nullify the votes I've received so far. And, what's wrong with wishing someone a beautiful day? – A5C1D2H2I1M1N2O1R2T1 Mar 22 '16 at 03:58
  • @AHandcartAndMohair I was wrong. Sorry. I didn't test your ave solution earlier, as I thought the dplyr is superior and had the comparisons based on that. May be with more groups and stuff, the dplyr would overtake. But, in the benchmarked example, it is your ave solution that is performing better. We can't change people's voting pattern. What I could do was to take back the two votes I had given to a person today who I thought was fair in voting. – akrun Mar 22 '16 at 04:01
  • I think there's a problem in your `system.time` for `ave`, because it's nowhere near as fast for me, and "dplyr" should have improved this behavior in comparison to the benchmarks that Paul had run. – A5C1D2H2I1M1N2O1R2T1 Mar 22 '16 at 04:07
  • @AHandcartAndMohair I don't know the issue. I run it couple of times and I was getting the same time – akrun Mar 22 '16 at 04:08
  • You're grouping by only one variable, so it would be fast. When the interaction of multiple variables come into play, it will slow down dramatically. It might even stall your system. – A5C1D2H2I1M1N2O1R2T1 Mar 22 '16 at 04:09
  • @AHandcartAndMohair Alright, I think the whole solution is a mess. I will delete this. – akrun Mar 22 '16 at 04:10
  • @AHandcartAndMohair I don't know what to say except thank you. – akrun Apr 05 '16 at 03:21
1

With base, something like

do.call(rbind, lapply(split(x, x$X1), 
                      function(y){y[table(y$X2) == 1,]}))
#     X1 X2
# 1.1  1  3
# 1.2  1  4
# 2    2  5

where split splits x into a list of data.frames split by the levels of X1, and then lapply subsets to rows where there is only one occurrence of the value of X2, tabulated by table. do.call(rbind then reassembles the resulting data.frames back into a single one.

alistaire
  • 42,459
  • 4
  • 77
  • 117