R - readRDS() & load() fail to give identical data.tables as the original

Question

Background

I tried to replace some CSV output files with rds files to improve efficiency. These are intermediate files that will serve as inputs to other R scripts.

Question

I started investigating when my scripts failed and found that readRDS() and load() do not return identical data tables as the original. Is this supposed to happen? Or did I miss something?

Sample code

library( data.table )

aDT <- data.table( a=1:10, b=LETTERS[1:10] )
saveRDS( aDT, file = "aDT.rds")
bDT <- readRDS( file = "aDT.rds" )
identical( aDT, bDT, ignore.environment = T )  # Gives 'False'

aDF <- data.frame( a=1:10, b=LETTERS[1:10] )
saveRDS( aDF, file = "aDF.rds")
bDF <- readRDS( file = "aDF.rds" )
identical( aDF, bDF, ignore.environment = T )  # Gives 'True'

# Using 'save'& 'load' doesn't help either
aDT2 <- data.table( a=1:10, b=LETTERS[1:10] )
save( aDT2, file = "aDT2.RData")
bDT2 <- aDT2; rm( aDT2 )
load( file = "aDT2.RData" )
identical( aDT2, bDT2, ignore.environment = T )  # Gives 'False'

I am running R ver 3.2.0 on Linux Mint and have tested with data.table ver 1.9.4 and 1.9.5 (latest).

Searching in SO and google returned this and this but I don't think they answer this issue. I am still trying to figure out why my scripts failed when I switched to rds but I am starting with this.

Would appreciate it very much if knowledgeable SO members can help. Thanks!

Edit:

Hi everyone, I happened to find a way to resolve the issue - have posted the solution below. I apologise if it's rather inelegant. Now, I have 2 further questions:

(1) Is there a better way?

(2) Can something be done at the R and/or data.table code to resolve this? I mean, this issue causes unpredictable bugs and is not the first thing that comes to mind. My 2 cents worth.

Hmm... good point... I've always only use `identical`. Going through ?`all.equal` shows that it's a test for 'near-equality', so perhaps the difference is in the pointers as mentioned by the 2 gentlemen below? — NoviceProg, Jul 06 '15 at 16:53

user227710 · Answer 1 · 2015-07-06T16:51:06.617

4

Probably, this has to do with pointers:

 attributes(aDT)
$names
[1] "a" "b"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: 0x0000000000390788>

> attributes(bDT)
$names
[1] "a" "b"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.table" "data.frame"

$.internal.selfref
<pointer: (nil)>

> attributes(bDF)
$names
[1] "a" "b"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.frame"

> attributes(aDF)
$names
[1] "a" "b"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10

$class
[1] "data.frame"

You can closely look at what's going using .Internal(inspect(.)) command:

.Internal(inspect(aDT))

 .Internal(inspect(bDT))

edited Jul 06 '15 at 16:51

answered Jul 06 '15 at 16:42

user227710

3,164
18
35

Thanks for your reply, @user227710. Is there any way to re-establish the pointer for the `data table` that has been re-loaded, without access to the original DT? – NoviceProg Jul 06 '15 at 16:55
1

@NoviceProg: Your issue is dicussed more in details [here](http://r.789695.n4.nabble.com/What-is-going-on-with-R-3-1-td4689002.html). `saveRDS` doesn't save the `.internal.selfref` so, I think it's not possible. – user227710 Jul 06 '15 at 16:59
I went through the link you provided as well as the SO thread in the link. Interestingly, they mentioned the issue has been resolved in `data table` v1.9.3, not sure if that OP was facing a similar issue. – NoviceProg Jul 07 '15 at 13:52

score 3 · Answer 2 · answered Jul 06 '15 at 16:40

3

The newly loaded data.table doesn't know the pointer value of the already loaded one. You could tell it with

attributes(bDT)$.internal.selfref <- attributes(aDT)$.internal.selfref
identical( aDT, bDT, ignore.environment = T )
# [1] TRUE

data.frame don't keep this attribute, probably because they don't do in place modification.

answered Jul 06 '15 at 16:40

Rorschach

31,301
5
78
129

Hi @LegalizeIt, I see where you're heading but what happens if the script loading bDT does not have access to aDT? That's the reason for the intermediate files (`csv`/`rds`). – NoviceProg Jul 06 '15 at 16:48

score 3 · Answer 3 · answered Oct 20 '17 at 03:18

3

The solution is to use setDT after load or readRDS

aDT2 <- readRDS("aDT2.RData")
setDT(aDT2)

source: Adding new columns to a data.table by-reference within a function not always working

answered Oct 20 '17 at 03:18

user3226167

3,131
2
30
34

score 1 · Answer 4 · answered Jul 07 '15 at 14:19

I happen to find a way that resolves the issue (disclaimer: it's a rather inelegant way but it works!) - adding then deleting a dummy column in the loaded data table leads to identical being 'True'. I have also successfully replaced csv with rds intermediate files in my own code.

To be honest, I don't understand enough of the inner workings of R nor data table to know why it works, so any explanations and/or more elegant solutions would be welcomed.

library( data.table )

aDT <- data.table( a=1:10, b=LETTERS[1:10] )
saveRDS( aDT, file = "aDT.rds")
bDT <- readRDS( file = "aDT.rds" )
identical( aDT, bDT, ignore.environment = T )  # Gives 'False'

bDT[ , aaa := NA ]; bDT[ , aaa := NULL ]
identical( aDT, bDT, ignore.environment = T )  # Now gives 'True'


# Using the add-del-col 'trick' works here too
aDT2 <- data.table( a=1:10, b=LETTERS[1:10] )
save( aDT2, file = "aDT2.RData")
bDT2 <- aDT2; rm( aDT2 )
load( file = "aDT2.RData" )
identical( aDT2, bDT2, ignore.environment = T )  # Gives 'False'

aDT2[ , aaa := NA ]; aDT2[ , aaa := NULL ]
identical( aDT2, bDT2, ignore.environment = T )  # Now gives 'True'

R - readRDS() & load() fail to give identical data.tables as the original

4 Answers4

Linked