7

I have a situation like this. Multiple data.table "rbinded".

library(data.table)
x <-  data.table(id=c(1,2,3,4),dsp=c(5,6,7,8),status=c(FALSE,TRUE,FALSE,TRUE))
y <-  data.table(id=c(1,2,3,4),dsp=c(6,6,7,8),status=c(FALSE,FALSE,FALSE,TRUE))
z <- data.table(id=c(1,2,3,4),dsp=c(5,6,9,8),status=c(FALSE,TRUE,FALSE,FALSE))
w <- data.table(id=c(1,2,3,4),dsp=c(5,6,7,NA),status=c(FALSE,TRUE,FALSE,TRUE))
setkey(x,id)
setkey(y,id)
setkey(z,id)
setkey(w,id)
Bigdt<-rbind(x,y,z,w)

I would like to obtain ONLY the not repeated rows like:

id  dsp status
1   6   FALSE
2   6   FALSE
3   9   FALSE
4   8   FALSE
4   NA  TRUE

So i tried

Resultdt<-Bigdt[!duplicated(Bigdt)]

but the result:

id  dsp status
1   5   FALSE
2   6   TRUE
3   7   FALSE
4   8   TRUE

does not match my espectations. I tried in different methods (as rbind is not mandatory), for example merge, join etc. the data.table package seems potentially the one that contains the solution...apparently. Any ideas?

Antonello Salis
  • 167
  • 1
  • 8

2 Answers2

9

You can do

Bigdt[, .N, by=names(Bigdt)][N == 1L][, N := NULL][]

   id dsp status
1:  1   6  FALSE
2:  2   6  FALSE
3:  3   9  FALSE
4:  4   8  FALSE
5:  4  NA   TRUE

To see how it works, run just part of the DT[][][][] chain:

  • Bigdt[, .N, by=names(Bigdt)]
  • Bigdt[, .N, by=names(Bigdt)][N == 1L]
  • Bigdt[, .N, by=names(Bigdt)][N == 1L][, N := NULL]
jangorecki
  • 16,384
  • 4
  • 79
  • 160
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Thank you Frank! :) – Antonello Salis May 27 '16 at 15:11
  • 1
    If you hit memory limits with Frank's solution you can optionally use `dt[, N := rowidv(dt)]` to add sequence of duplicated rows to dataset in-place, then subset on `N==1L`. – jangorecki May 27 '16 at 19:28
  • @jangorecki The data table has to be sorted for that? I'll add it to the answer. Feel free to edit in any other details or alternatives. – Frank May 27 '16 at 20:00
  • @Frank no, see `rowidv(data.table(c("c","a","c","b","c","a")))`, it is extensively used in set operators using `all=TRUE` feature for handling duplicates. Just realized it won't answer the question as it lists all the unique rows instead - removing it from the answer then. – jangorecki May 27 '16 at 20:46
3

You may also try

Bigdt[!(duplicated(Bigdt)|duplicated(Bigdt, fromLast=TRUE))]
#   id dsp status
#1:  1   6  FALSE
#2:  2   6  FALSE
#3:  3   9  FALSE
#4:  4   8  FALSE
#5:  4  NA   TRUE

Or if we are using .SD

Bigdt[Bigdt[,!(duplicated(.SD)|duplicated(.SD, fromLast=TRUE))]]

Or another option would be grouping by the column names, find the row index with .I and subset the dataset

Bigdt[Bigdt[, .I[.N==1], by = names(Bigdt)]$V1]
akrun
  • 874,273
  • 37
  • 540
  • 662