How to get unique() to work on data.tables with character columns?

Question

If I create an R data.table with string columns without calling stringsAsFactors=TRUE and then try to take unique rows of the data table with unique, then the strings get stripped from the resulting table, though they are considered in determining which rows are unique.

> dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2), stringsAsFactors=FALSE)
> unique(dt)
   x y
1:   1
2:   2
3:   2
> dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2), stringsAsFactors=TRUE)
> unique(dt)
   x y
1: a 1
2: b 2
3: c 2

Is this correct behavior? I'm on Cygwin and have uncovered a few mysterious Cygwin-specific issues in the R internals before. Here's the readout of sessionInfo():

R version 3.4.0 (2017-04-21)
Platform: x86_64-unknown-cygwin (64-bit)
Running under: CYGWIN_NT-6.1 INT-3A02 2.8.1(0.312/5/3) 2017-07-03 14:11 x86_64 Cygwin

Matrix products: default
LAPACK: /usr/lib/R/modules/lapack.dll

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] bit_1.1-12     compiler_3.4.0 bit64_0.9-7

On my machine (Ubuntu 16.04 with `R version 3.4.1 (2017-06-30) -- "Single Candle"`) it is working also using the first option. — Garini, Jul 12 '17 at 17:47
I'm not surprised; I've had Cygwin-only problems with R strings before (https://stackoverflow.com/questions/44187906/merging-large-data-tables-on-character-columns-causes-segfault). — Connor Harris, Jul 12 '17 at 17:49

score 1 · Accepted Answer · answered Jul 12 '17 at 20:24

The duplicated() function may provide a workaround. dt[!duplicated(dt), ] returns the same results as unique(dt) for both cases on my system (Ubuntu linux, R version 3.13.0-121-generic)

library(data.table)
dt <- data.table(x=factor(c('a', 'a', 'b', 'c')), y=c(1, 1, 2, 2))
all.equal(unique(dt), dt[!duplicated(dt), ])
[1] TRUE
>

dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2))
all.equal(unique(dt), dt[!duplicated(dt), ])
[1] TRUE
>

Related post: Finding ALL duplicate rows, including "elements with smaller subscripts"

I don't know why this should work, but it does. Thanks! – Connor Harris Jul 14 '17 at 20:20 — Connor Harris, Jul 14 '17 at 20:20

How to get unique() to work on data.tables with character columns?

1 Answers1