1

Do I need to use copy() inside a function to avoid undesired modification of the input data.table?

For example

myfun <- function(mydata) {   
     mydata[,newcolumn := .N,by=id]   
     setnames(mydata, "newcolumn", "Count")
     return(table(mydata$Count))
}

or

myfun <- function(mydata) {   
     temp <- copy(mydata)
     temp[,newcolumn := .N,by=id]   
     setnames(temp, "newcolumn", "Count")
     return(table(temp$Count))
}

Or does passing the data.table to the function already creates a copy even if I assign things with:=?

skan
  • 7,423
  • 14
  • 59
  • 96
  • 2
    Maybe related: [Understanding exactly when a data.table is a reference to (vs a copy of) another data.table](https://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another) – Henrik Feb 26 '18 at 19:04
  • Re the last question, no it does not create a copy on its own. I think they want to export the `shallow` function eventually that will make this copy less wasteful https://github.com/Rdatatable/data.table/issues/2323 Also relevant https://stackoverflow.com/a/45925735/ – Frank Feb 26 '18 at 19:04
  • In short, with **copy(mydata)** the original table stays unaffected. If that's what you want, then it's advised to copy the data table to another. So, in second function, the **newColumn** gets created in **temp** while the **mydata** table remains unaffected. – YOLO Feb 26 '18 at 19:07
  • @ManishSaraswat but does the first function affect the original data.table even if it's inside a function? I've been trying and it doesn't seem to, but I'm afraid it could produce unexpected results – skan Feb 26 '18 at 19:28
  • @skan no, it won't affect in the first function as well. I forgot to notice, since it's inside the function, it won't affect the data.table globally. or does it? – YOLO Feb 26 '18 at 19:52
  • @ManishSaraswat even if it uses := assignation? – skan Feb 26 '18 at 19:54
  • Sure, it will affect it except in special cases, like perhaps immediately after loading a data.table from disk https://stackoverflow.com/a/25558645/ – Frank Feb 26 '18 at 20:01
  • @skan my bad, I just tried to see what happens and the original table also got updated. In your function, may be you want to rename 'newcolumn' to 'Count'. That's an error I guess. – YOLO Feb 26 '18 at 20:04
  • @ManishSaraswat oh, yes, I'm sorry, this comes from a larger code. I think now is OK. At the end. Should we use copy() inside functions or not? – skan Feb 26 '18 at 20:31
  • @skan if you don't want to update your existing table, then use copy(). If you don't care if the existing table is also getting updated, then don't use copy(). Although, copying to a new table consumes more memory, but that's the tradeoff you have to consider. – YOLO Feb 26 '18 at 20:34

1 Answers1

2

The linked answer from @Henrik to https://stackoverflow.com/a/10226454/4468078 does explain all details to answer your question.

This (modified) version of your example function does not modify the passed data.table:

library(data.table)
dt <- data.table(id = 1:4, a = LETTERS[1:4])
myfun2 <- function(mydata) {   
  x <- mydata[, .(newcolumn = .N), by=id]
  setnames(x, "newcolumn", "Count")
  return(table(x$Count))
}
myfun2(dt)

This does not copy the whole data.table (which would be a waste of RAM and CPU time) but only writes the result of the aggregation into a new data.table which you can modify without side effects (= no changes of the original data.table).

> str(dt)
Classes ‘data.table’ and 'data.frame':  4 obs. of  2 variables:
 $ id: int  1 2 3 4
 $ a : chr  "A" "B" "C" "D"

A data.table is always passed by reference to a function so you have to be careful not to modify it unless you are absolutely sure you want to do this.

The data.table package was designed exactly for this efficient way of modifying data without the usual "COW" ("copy on (first) write") principle to support efficient data manipulation.

"Dangerous" operations that modify a data.table are mainly:

  • := assignment to modify or create a new column "in-place"
  • all set* functions

If you don't want to modify a data.table you can use just row filters, and column (selection) expressions (i, j, by etc. arguments).

Chaining does also prevent the modification of the original data.frame if you modify "by ref" in the second (or later) chain:

myfun3 <- function(mydata) {
  # chaining also creates a copy 
  return(mydata[id < 3,][, a := "not overwritten outside"])
}

myfun3(dt)
# > str(dt)
# Classes ‘data.table’ and 'data.frame':    4 obs. of  3 variables:
# $ id: int  1 2 3 4
# $ a : chr  "A" "B" "C" "D"
R Yoda
  • 8,358
  • 2
  • 50
  • 87