How to identify only "not duplicated" rows

Question

I have a situation like this. Multiple data.table "rbinded".

library(data.table)
x <-  data.table(id=c(1,2,3,4),dsp=c(5,6,7,8),status=c(FALSE,TRUE,FALSE,TRUE))
y <-  data.table(id=c(1,2,3,4),dsp=c(6,6,7,8),status=c(FALSE,FALSE,FALSE,TRUE))
z <- data.table(id=c(1,2,3,4),dsp=c(5,6,9,8),status=c(FALSE,TRUE,FALSE,FALSE))
w <- data.table(id=c(1,2,3,4),dsp=c(5,6,7,NA),status=c(FALSE,TRUE,FALSE,TRUE))
setkey(x,id)
setkey(y,id)
setkey(z,id)
setkey(w,id)
Bigdt<-rbind(x,y,z,w)

I would like to obtain ONLY the not repeated rows like:

id  dsp status
1   6   FALSE
2   6   FALSE
3   9   FALSE
4   8   FALSE
4   NA  TRUE

So i tried

Resultdt<-Bigdt[!duplicated(Bigdt)]

but the result:

id  dsp status
1   5   FALSE
2   6   TRUE
3   7   FALSE
4   8   TRUE

does not match my espectations. I tried in different methods (as rbind is not mandatory), for example merge, join etc. the data.table package seems potentially the one that contains the solution...apparently. Any ideas?

The code works fine for me (by copy pasting from here), Frank, sorry but your code does not give the espected result... — Antonello Salis, May 27 '16 at 15:03
correct! Sorry, i will edit, but what about the other lines? — Antonello Salis, May 27 '16 at 15:04
This is what I obtain: id dsp status N 1 6 FALSE 1 2 6 FALSE 1 3 9 FALSE 1 4 8 FALSE 1 4 NA TRUE 1 — Antonello Salis, May 27 '16 at 15:09
Here's the question for R in general, with an answer like one of @akrun's below http://stackoverflow.com/q/13763216/ — Frank, May 27 '16 at 21:00

score 9 · Answer 1 · edited May 27 '16 at 20:50

9

You can do

Bigdt[, .N, by=names(Bigdt)][N == 1L][, N := NULL][]

   id dsp status
1:  1   6  FALSE
2:  2   6  FALSE
3:  3   9  FALSE
4:  4   8  FALSE
5:  4  NA   TRUE

To see how it works, run just part of the DT[][][][] chain:

Bigdt[, .N, by=names(Bigdt)]
Bigdt[, .N, by=names(Bigdt)][N == 1L]
Bigdt[, .N, by=names(Bigdt)][N == 1L][, N := NULL]

edited May 27 '16 at 20:50

jangorecki

16,384
4
79
160

answered May 27 '16 at 15:10

Frank

66,179
8
96
180

1

Thank you Frank! :) – Antonello Salis May 27 '16 at 15:11
1

If you hit memory limits with Frank's solution you can optionally use `dt[, N := rowidv(dt)]` to add sequence of duplicated rows to dataset in-place, then subset on `N==1L`. – jangorecki May 27 '16 at 19:28
@jangorecki The data table has to be sorted for that? I'll add it to the answer. Feel free to edit in any other details or alternatives. – Frank May 27 '16 at 20:00
@Frank no, see `rowidv(data.table(c("c","a","c","b","c","a")))`, it is extensively used in set operators using `all=TRUE` feature for handling duplicates. Just realized it won't answer the question as it lists all the unique rows instead - removing it from the answer then. – jangorecki May 27 '16 at 20:46

akrun · Answer 2 · 2016-05-27T20:53:00.497

You may also try

Bigdt[!(duplicated(Bigdt)|duplicated(Bigdt, fromLast=TRUE))]
#   id dsp status
#1:  1   6  FALSE
#2:  2   6  FALSE
#3:  3   9  FALSE
#4:  4   8  FALSE
#5:  4  NA   TRUE

Or if we are using .SD

Bigdt[Bigdt[,!(duplicated(.SD)|duplicated(.SD, fromLast=TRUE))]]

Or another option would be grouping by the column names, find the row index with .I and subset the dataset

Bigdt[Bigdt[, .I[.N==1], by = names(Bigdt)]$V1]

How to identify only "not duplicated" rows

2 Answers2