We can use data.table
. We convert the 'data.frame' to 'data.table' (setDT(x)
, grouped by the first column i.e. "X1", if
, there is only one observation, return the row else remove all the duplicates and return only the unique row.
library(data.table)
setDT(x)[, if(.N==1) .SD else
.SD[!(duplicated(X2)|duplicated(X2, fromLast=TRUE))], X1]
# X1 X2
#1: 1 3
#2: 1 4
#3: 2 5
If we are using both "X1" and "X2" as grouping variable
setDT(x)[x[, .I[.N==1], .(X1, X2)]$V1]
# X1 X2
#1: 1 3
#2: 1 4
#3: 2 5
NOTE: Data.table is very fast and is compact.
Or without using any group by option, with base R
we can do
x[!(duplicated(x)|duplicated(x, fromLast=TRUE)),]
# X1 X2
#1 1 3
#2 1 4
#4 2 5
Or with tally
from dplyr
library(dplyr)
x %>%
group_by_(.dots= names(x)) %>%
tally() %>%
filter(n==1) %>%
select(-n)
Note that this should be faster than the other dplyr solution.
Benchmarks
library(data.table)
library(dplyr)
Sample data
set.seed(24)
x1 <- data.frame(X1 = sample(1:5000, 1e6, replace=TRUE),
X2 = sample(1:10000, 1e6, replace=TRUE))
x2 <- copy(as.data.table(x1))
Base R approaches
system.time(x1[with(x1, ave(X2, sprintf("%s__%s", X1, X2), FUN = length)) == 1, ])
# user system elapsed
# 20.245 0.002 20.280
system.time(x1[!(duplicated(x1)|duplicated(x1, fromLast=TRUE)), ])
# user system elapsed
# 1.994 0.000 1.998
dplyr approaches
system.time(x1 %>% group_by(X1, X2) %>% filter(n() == 1))
# user system elapsed
# 33.400 0.006 33.467
system.time(x1 %>% group_by_(.dots= names(x2)) %>% tally() %>% filter(n==1) %>% select(-n))
# user system elapsed
# 2.331 0.000 2.333
data.table approaches
system.time(x2[x2[, .I[.N==1], list(X1, X2)]$V1])
# user system elapsed
# 1.128 0.001 1.131
system.time(x2[, .N, by = list(X1, X2)][N == 1][, N := NULL][])
# user system elapsed
# 0.320 0.000 0.323
Summary: The "data.table" approaches win hands down, but if you're unable to use the package for some reason, using duplicated
from base R also performs quite well.