28

I'm migrating from data frames and matrices to data tables, but haven't found a solution for extracting the unique rows from a data table. I presume there's something I'm missing about the [,J] notation, though I've not yet found an answer in the FAQ and intro vignettes. How can I extract the unique rows, without converting back to data frames?

Here is an example:

library(data.table)
set.seed(123)
a <- matrix(sample(2, 120, replace = TRUE), ncol = 3)
a <- as.data.frame(a)
b <- as.data.table(a)

# Confirm dimensionality
dim(a) # 40  3
dim(b) # 40  3

# Unique rows using all columns
dim(unique(a))  # 8 3
dim(unique(b))  # 34 3

# Unique rows using only a subset of columns
dim(unique(a[,c("V1","V2")]))   # 4 2
dim(unique(b[,list(V1,V2)]))    # 29 2

Related question: Is this behavior a result of the data being unsorted, as with the Unix uniq function?

Iterator
  • 20,250
  • 12
  • 75
  • 111

2 Answers2

32

Before data.table v1.9.8, the default behavior of unique.data.table method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key was NULL (the default), one would get the original data set back (as in OPs situation).

As of data.table 1.9.8+, unique.data.table method uses all columns by default which is consistent with the unique.data.frame in base R. To have it use the key columns, explicitly pass by = key(DT) into unique (replacing DT in the call to key with the name of the data.table).

Hence, old behavior would be something like

library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b)) 
## [1] 8 3

While for data.table v1.9.8+, just

b <- data.table(a) 
dim(unique(b)) 
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them

Or without a copy

setDT(a)
dim(unique(a))
## [1] 8 3
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    This is rather interesting. Practically speaking, this behavior is like the Unix `uniq` function: it depends on the data being sorted. I haven't checked whether the base R function, `unique`, depends on sorting, though it appears to present the output in the original order. Btw, where did you find this in the documentation? I must've missed that part. – Iterator Sep 26 '11 at 23:28
  • 3
    Look at the entry for `duplicated()` in the data.table [pdf](http://cran.r-project.org/web/packages/data.table/data.table.pdf), or try ?unique.data.table. –  Sep 26 '11 at 23:53
  • Excellent pointers! I see that `unique` is buried in the documentation. Hopefully that will be fixed. Good find on `?unique.data.table`. I also overlooked trying `methods(class = "data.table")`. – Iterator Sep 27 '11 at 00:00
  • I've raised [bug #1601](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1601&group_id=240&atid=975) to address the original point. Thanks. – Matt Dowle Sep 27 '11 at 09:11
  • 12
    That's now fixed in v1.6.7 so unique on an unsorted data.table now works without needing to set a key. Also improved the documentation. – Matt Dowle Oct 09 '11 at 00:05
  • Fyi, this has changed -- `unique` used to use the key (if any) by default, but now it uses all columns. – Frank Mar 04 '17 at 00:17
  • `unique.data.frame` also works and supports more column types (e.g. lists) – Ufos Mar 15 '18 at 15:04
7

As mentioned by Seth the data.table package has evolved and now proposes optimized functions for this.

To all the ones who don't want to get into the documentation, here is the fastest and most memory efficient way to do what you want :

uniqueN(a)

And if you only want to choose a subset of columns you could use the 'by' argument :

uniqueN(a,by = c('V1','V2'))

EDIT : As mentioned in the comments this will only gives the count of unique rows. To get the unique values, use unique instead :

unique(a)

And for a subset :

unique(a[c('V1',"V2")], by=c('V1','V2'))

Sacha
  • 260
  • 3
  • 10
  • mm when I do this I don't get a data.table, I just get a vector with the number of observations afterwards? that is, a summary, of sorts. – emilBeBri Mar 27 '18 at 09:22
  • This is question is not about counting uniques, rather extracting unique rows so I don't see how your answer answers the question. – David Arenburg Jun 17 '18 at 11:04
  • @DavidArenburg you are right. I have just edited the answer if you want to get the rows instead of the count of rows. – Sacha Jun 17 '18 at 12:43
  • That doesn't add anything to the existing answer though. That was my point in the first place. If you have some minor edits to the existing answer, you should edit it instead of posting a new one. – David Arenburg Jun 17 '18 at 14:06