Filtering out duplicated/non-unique rows in data.table

Question

Edit 2019: This question was asked prior to changes in data.table in November 2016, see the accepted answer below for both the current and previous methods.

I have a data.table table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this: df -> unique(df[,c('V1', 'V2')]) but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE]) but it seems to still only operate on the key of the data.table and not the whole row.

Any suggestions?

Cheers, Davy

Example

>dt
      V1   V2
[1,]  A    B
[2,]  A    C
[3,]  A    D
[4,]  A    B
[5,]  B    A
[6,]  C    D
[7,]  C    D
[8,]  E    F
[9,]  G    G
[10,] A    B

in the above data.table where V2 is the table key, only rows 4,7, and 10 would be removed.

dt <- data.table::data.table(
  V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"),
  V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G"),
)

Without using unique `dt[, .N,by=.(V1,V2)][,1:2]` – Akki Dec 22 '17 at 11:14 — Akki, Dec 22 '17 at 11:14

score 125 · Accepted Answer · edited Nov 07 '19 at 08:23

125

For v1.9.8+ (released November 2016)

From ?unique.data.table By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  C  D
6:  E  F
7:  G  G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
   V1 V2
1:  A  B
2:  A  C
3:  A  D
4:  B  A
5:  E  F
6:  G  G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
     V1 V2
[1,]  B  A
[2,]  A  B
[3,]  A  C
[4,]  A  D
[5,]  E  F
[6,]  G  G

edited Nov 07 '19 at 08:23

Brian Stamper

2,143
1
18
41

answered Aug 03 '12 at 09:04

Andrie

176,377
47
447
496

2

This only works if no key has been set. I'll edit the question above to make that clear. Sorry – Davy Kavanagh Aug 03 '12 at 09:12
2

as akrun answered here: http://stackoverflow.com/questions/40949023/r-somehow-unique-is-not-working-for-my-data-table the first version now needs a by= option to work – Jakob Dec 03 '16 at 15:00
@PeterPan the link you posted is dead – wolfsatthedoor Jan 22 '17 at 01:10
@robertevansanders weird. So what works now? For me the by= option still works – Jakob Jan 23 '17 at 09:01
17

@Andrie this solution no longer works, as @PeterPan pointed out. `data.table` no longer considers `unique()` in keys. The option `unique(, by = c(keys))` has to be used now. – altabq Mar 01 '17 at 14:10
4

Let it be known that altabq is right, and the things in keys must be surrounded with quotation marks. So you would want unique(dt, by=c("V1","V2")) as your answer. – Corey Levinson Sep 17 '17 at 08:29
As others have pointed out, I also just realized this does not work anymore in the current version and might lead to confusion. By the way: Does anyone know why this has changed? The previous way of handling it via keys was so much more convenient imho. The above solution should be changed because currently it leads to confusion. – EDC Oct 22 '17 at 23:13
How can I remove only consecutive duplicated rows? – skan Oct 16 '19 at 19:42

score 9 · Answer 2 · answered Jan 16 '13 at 02:50

With your example data.table...

> dt<-data.table(V1 = c("B", "A", "A", "A", "A", "A", "C", "C", "E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F", "G"))
> setkey(dt,V2)

Consider the following tests:

> haskey(dt) # obviously dt has a key, since we just set it
[1] TRUE

> haskey(dt[,list(V1,V2)]) # ... but this is treated like a "new" table, and does not have a key
[1] FALSE

> haskey(dt[,.SD]) # note that this still has a key
[1] TRUE

So, you can list the columns of the table and then take the unique() of that, with no need for setting the key to all columns or dropping it (by setting it to NULL) as required by the solution from @Andrie (and edited by @MatthewDowle). The solutions suggested by @Pop and @Rahul didn't work for me.

See Try 3 below, which is very similar to your initial try. Your example was not clear so I'm not sure why it didn't work. Also it was a few months ago when you posted the question, so maybe data.table was updated?

> unique(dt) # Try 1: wrong answer (missing V1=C and V2=D)
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G

> dt[!duplicated(dt)] # Try 2: wrong answer (missing V1=C and V2=D)
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  E  F
6:  G  G

> unique(dt[,list(V1,V2)]) # Try 3: correct answer; does not require modifying key
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G

> setkey(dt,NULL)
> unique(dt) # Try 4: correct answer; requires key to be removed
   V1 V2
1:  B  A
2:  A  B
3:  A  C
4:  A  D
5:  C  D
6:  E  F
7:  G  G

Maybe a new `unique(...,use.key=FALSE)` argument would help; now filed as [FR#2483](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2483&group_id=240&atid=978). — Matt Dowle, Jan 18 '13 at 00:23
Hi @MatthewDowle. Yes, that would be a nice convenience. I think your comment in the FR is also correct--if the key is unique then `use.key=FALSE` could be ignored. — dnlbrky, Jan 18 '13 at 14:12
data.table 1.9.6 (and not doubt earlier versions) has option `by=` which can be used to override the key. Setting `by=NULL` "uses all columns and acts like the analogous data.frame methods." — JWilliman, Dec 01 '15 at 23:20

score 1 · Answer 3 · answered Aug 03 '12 at 09:03

1

unique(df) works on your example.

answered Aug 03 '12 at 09:03

Pop

12,135
5
55
68

score 1 · Answer 4 · edited May 20 '20 at 11:14

1

This should work for you

dt <- unique(dt, by = c('V1', 'V2'))

edited May 20 '20 at 11:14

mnist

6,571
1
18
41

answered Apr 08 '19 at 10:21

Magma

405
1
4
7

1

The OP wants to remove duplicated rows,across the whole row of the data table as opposed to just the key. This should do that. – Magma Apr 08 '19 at 11:47

score 0 · Answer 5 · answered May 20 '20 at 09:54

0

Keeping data.table notation you can use:

unique(df[, .(V1, V2, V3), nomatch=0 ])

As here https://stackoverflow.com/a/31875208/10087503

I haven't compared speed of this vs Magma's version.

answered May 20 '20 at 09:54

aclong

181
2
7

Filtering out duplicated/non-unique rows in data.table

5 Answers5

Linked

Related