2

If I create an R data.table with string columns without calling stringsAsFactors=TRUE and then try to take unique rows of the data table with unique, then the strings get stripped from the resulting table, though they are considered in determining which rows are unique.

> dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2), stringsAsFactors=FALSE)
> unique(dt)
   x y
1:   1
2:   2
3:   2
> dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2), stringsAsFactors=TRUE)
> unique(dt)
   x y
1: a 1
2: b 2
3: c 2

Is this correct behavior? I'm on Cygwin and have uncovered a few mysterious Cygwin-specific issues in the R internals before. Here's the readout of sessionInfo():

R version 3.4.0 (2017-04-21)
Platform: x86_64-unknown-cygwin (64-bit)
Running under: CYGWIN_NT-6.1 INT-3A02 2.8.1(0.312/5/3) 2017-07-03 14:11 x86_64 Cygwin

Matrix products: default
LAPACK: /usr/lib/R/modules/lapack.dll

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] bit_1.1-12     compiler_3.4.0 bit64_0.9-7
Uwe
  • 41,420
  • 11
  • 90
  • 134
Connor Harris
  • 421
  • 5
  • 14
  • 3
    On my machine (Ubuntu 16.04 with `R version 3.4.1 (2017-06-30) -- "Single Candle"`) it is working also using the first option. – Garini Jul 12 '17 at 17:47
  • I'm not surprised; I've had Cygwin-only problems with R strings before (https://stackoverflow.com/questions/44187906/merging-large-data-tables-on-character-columns-causes-segfault). – Connor Harris Jul 12 '17 at 17:49

1 Answers1

1

The duplicated() function may provide a workaround. dt[!duplicated(dt), ] returns the same results as unique(dt) for both cases on my system (Ubuntu linux, R version 3.13.0-121-generic)

library(data.table)
dt <- data.table(x=factor(c('a', 'a', 'b', 'c')), y=c(1, 1, 2, 2))
all.equal(unique(dt), dt[!duplicated(dt), ])
[1] TRUE
>

dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2))
all.equal(unique(dt), dt[!duplicated(dt), ])
[1] TRUE
>

Related post: Finding ALL duplicate rows, including "elements with smaller subscripts"

Damian
  • 1,385
  • 10
  • 10