Specific removing all duplicates with R

Question

For example I have two columns:

And I need to display only values which are unique for column Var1:

I can do you like this:

 mydata=mydata[!unique(mydata$Var1),]

But when I use the same formula for my large data set with about 1 million observations, nothing happens - the sample size is still the same. Could you please explain my why?

Thank you!

you should look at what `unique(mydata$Var1)` gives you and then what `!unique(mydata$Var1)` gives and then what `mydata[!unique(mydata$Var1), ]` gives — rawr, Apr 29 '15 at 21:51
The irony that a post about removing duplicates was deemed to be a duplicate. — Tyler Rinker, Apr 29 '15 at 22:59

David Arenburg · Answer 1 · 2015-04-29T22:21:00.630

With data.table (as it seem to be tagged with it) I would do

indx <- setDT(DT)[, .I[.N == 1], by = Var1]$V1 
DT[indx]
#    Var1 Var2
# 1:    4    8
# 2:    5   67
# 3:    6   12

Or... as @eddi reminded me, you can simply do

DT[, if(.N == 1) .SD, by = Var1]

Or (per the mentioned duplicates) with v >= 1.9.5 you could also do something like

setDT(DT, key = "Var1")[!(duplicated(DT) | duplicated(DT, fromLast = TRUE))]

bgoldst · Answer 2 · 2015-04-29T21:55:15.853

You can use this:

df <- data.frame(Var1=c(1,1,2,2,3,3,4,5,6), Var2=c(12,65,68,98,49,24,8,67,12) );
df[ave(1:nrow(df),df$Var1,FUN=length)==1,];
##   Var1 Var2
## 7    4    8
## 8    5   67
## 9    6   12

This will work even if the Var1 column is not ordered, because ave() does the necessary work to collect groups of equal elements (even if they are non-consecutive in the grouping vector) and map the result of the function call (length() in this case) back to each element that was a member of the group.

Regarding your code, it doesn't work because this is what unique() and its negation returns:

unique(df$Var1);
## [1] 1 2 3 4 5 6
!unique(df$Var1);
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

As you can see, unique() returns the actual unique values from the argument vector. Negation returns true for zero and false for everything else.

Thus, you end up row-indexing using a short logical vector (it will be short if there were any duplicates removed by unique()) consisting of TRUE where there were zeroes, and FALSE otherwise.

Specific removing all duplicates with R

2 Answers2

Linked