1

This has been bothering me for quite a while, and I think there's something that I just don't understand about the data.table package. If I want two "slices" of my data.table, and rename one to a new name, if I forget to enter "copy", then data.table treats the two objects with new names as the same thing.

  1. Is the only systematic way to solve this issue by using the "copy" function? i.e. if I don't use "copy", will I always get this behavior of treating the same objects as the same thing?
  2. What's the purpose of this feature? Something to do with memory storage..? It seems like it can cause some serious inadvertent errors if I decide to change one DT and then use the original object. Also, if a new data.table user is coming from base R, and doesn't know about this behavior, then there will be some systematic problems with all their code.
  3. What's the point of the setDT function if it doesn't actually "set" the data table into a new object?

Here's an illustrative example:

library(data.table)
#####BOTH_SETDT & COPY#####
first_dt <- data.frame(a = c(1,2,3), b = c(9,8,7))
setDT(first_dt)
second_dt <- copy(first_dt)
setDT(second_dt)

first_dt[,a:=a/50]
second_dt[,b:=b/50]
print(first_dt)
#      a b
#1: 0.02 9
#2: 0.04 8
#3: 0.06 7
print(second_dt)
#   a    b
#1: 1 0.18
#2: 2 0.16
#3: 3 0.14

#####BOTH SETDT#####
first_dt <- data.frame(a = c(1,2,3), b = c(9,8,7))
setDT(first_dt)
second_dt <- first_dt
setDT(second_dt)

first_dt[,a:=a/50]
second_dt[,b:=b/50]
print(first_dt)
#      a    b
#1: 0.02 0.18
#2: 0.04 0.16
#3: 0.06 0.14
print(second_dt)
#      a    b
#1: 0.02 0.18
#2: 0.04 0.16
#3: 0.06 0.14

#####SINGLE SETDT#####
first_dt <- data.frame(a = c(1,2,3), b = c(9,8,7))
setDT(first_dt)
second_dt <- first_dt

first_dt[,a:=a/50]
second_dt[,b:=b/50]
print(first_dt)
#      a    b
#1: 0.02 0.18
#2: 0.04 0.16
#3: 0.06 0.14
print(second_dt)
#      a    b
#1: 0.02 0.18
#2: 0.04 0.16
#3: 0.06 0.14

#####AS.DATA.TABLE#####
first_dt <- as.data.table(data.frame(a = c(1,2,3), b = c(9,8,7)))
second_dt <- first_dt

first_dt[,a:=a/50]
second_dt[,b:=b/50]
print(first_dt)
#      a    b
#1: 0.02 0.18
#2: 0.04 0.16
#3: 0.06 0.14
print(second_dt)
#      a    b
#1: 0.02 0.18
#2: 0.04 0.16
#3: 0.06 0.14

#####AS.DATA.TABLE WITH JUST COPY#####
first_dt <- as.data.table(data.frame(a = c(1,2,3), b = c(9,8,7)))
second_dt <- copy(first_dt)

first_dt[,a:=a/50]
second_dt[,b:=b/50]
print(first_dt)
#      a b
#1: 0.02 9
#2: 0.04 8
#3: 0.06 7
print(second_dt)
#   a    b
#1: 1 0.18
#2: 2 0.16
#3: 3 0.14

#####ANOTHER AS.DATA.TABLE#####
first_dt <- (data.frame(a = c(1,2,3), b = c(9,8,7)))
second_dt <- as.data.table(first_dt)
first_dt <- as.data.table(first_dt)
first_dt[,a:=a/50]
second_dt[,b:=b/50]
print(first_dt)
print(second_dt)
Daycent
  • 455
  • 4
  • 15
  • 2
    See `?setDT`; the function is for users who want to use `data.table` interface on a data.frame but may have limited memory. It is about memory and speed; a copy will double the RAM usage relative to the data.frame object. And it is speedier because memory allocations take time. It sounds like you want `as.data.table` as it includes the copy. These codes are largely equivalent: `setDT(copy(DF)); as.data.table(DF)` – Cole Dec 06 '20 at 04:36
  • Ah, I didn't know that! thank you Cole! – Daycent Dec 06 '20 at 06:32

1 Answers1

3

Yes, it's intentional & very much by design. We hope new users will checkout our vignettes (e.g. 1, 2, 3) to get an idea of the why.

data.table is designed with large data sets in mind (e.g. 1GB, 10GB, or 50GB). Wasting memory with such data can be the difference between analysis working and being impossible. You can see the impact of this in this benchmark -- several alternatives to data.table simply fail to complete tasks on a 50GB data set, even though the machine has plenty of memory (128GB).

The reference semantics you observe are a necessary tradeoff to achieve this without the need to spill to disk or parallelize to different machines.

My recommendation is to be aware of this behavior, and use it to be more careful in your analysis -- do you really need a new table that has the same number of rows? Can what you want be achieved in a different way?

copy and as.data.table are always available as a workaround when needed, but I think part of using data.table successfully entails a slight change in approach.

PS If there's anything arcane or hard to understand in the vignettes, or you otherwise have feedback, we would love to hear it -- feel free to file an Issue.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Thank you Michael! I had not seen the vignettes, and had been figuring data.table by trial and error. This clears things up. – Daycent Dec 06 '20 at 06:34
  • Hey Michael, Can I ask, does this mean for functions that do not take in Data tables as their input, that I should always end them with "copy" if I have to use them on multiple objects? clean_ilostat <- function(id) { # id is a string variable.; DT <- label_ilostat(get_ilostat(id)) # getting data from API; #[I do some things to clean DT using data.table syntax]; dr <- copy(DT); DT <- NULL; dr } – Daycent Dec 07 '20 at 16:14
  • At a glance, it looks like `DT` is a new object, in which case a `copy` wouldn't be necessary. I'm not sure I can offer perfectly general advice; maybe this question can help? https://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another – MichaelChirico Dec 08 '20 at 02:55
  • This was the reason that made me to convert from python `pandas` to R `data.table`.. It simply works, and is a lot faster. – Matthew Son Dec 08 '20 at 16:03
  • really? I've been actually thinking to switch to Python, though I haven't actually used it before. seems like a lot of the new AI packages are being developed on Python or Julia. – Daycent Dec 08 '20 at 23:44