0

When I used to remove columns, I would always do something like:

DT[, Tax:=NULL]

Sometimes to make a backup, I would do something like

DT2 <- DT

But just a second ago this happened:

library(data.table)
DT <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000, 
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001, 
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df", 
"tbl", "data.frame"))

DT2 <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000, 
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001, 
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df", 
"tbl", "data.frame"))

setDT(DT) 
setDT(DT2)
DT2 <- DT

# Removes Tax in BOTH datasets !!
DT2[, Tax:=NULL]

I remember something about this when starting to learn about data.table, but obviously this is not really desirable (for me at least).

What is the proper way to deal with this without accidentally deleting columns?

Tom
  • 2,173
  • 1
  • 17
  • 44
  • Since `data.table` uses referential semantics (*in-place*, not copy-on-write like most of R), then your assignment `DT2 <- DT` means that both variables point to the same data. This is one of the gotchas with "memory-efficient operations" that rely on in-place work: if you goof, you lose it. Any way that will protect you against this kind of mistake will be memory-inefficient, keeping one (or more) copies of data sitting around. – r2evans Dec 03 '20 at 14:01
  • 3
    If you need `DT2` to be a different dataset, then use `DT2 <- copy(DT)`, after which `DT2[,Tax:=NULL]` will not affect `DT`. – r2evans Dec 03 '20 at 14:02
  • You can find your answer here: https://stackoverflow.com/a/10226454/3768871 – OmG Dec 03 '20 at 14:03
  • Thank you very much! – Tom Dec 03 '20 at 16:44
  • 1
    You can also read FAQ. @r2evans please make an answer from your comment. – jangorecki Dec 03 '20 at 18:30

1 Answers1

2

(Moved from comments.)

Since data.table uses referential semantics (in-place, not copy-on-write like most of R), then your assignment DT2 <- DT means that both variables point to the same data. This is one of the gotchas with "memory-efficient operations" that rely on in-place work: if you goof, you lose it. Any way that will protect you against this kind of mistake will be memory-inefficient, keeping one (or more) copies of data sitting around.

If you need DT2 to be a different dataset, then use

DT2 <- copy(DT)

after which DT2[,Tax:=NULL] will not affect DT.

I find MattDowle's answer here to be informative/helpful here (though the question explicitly asked about copy, not just the behavior you mentioned).

r2evans
  • 141,215
  • 6
  • 77
  • 149