The linked answer from @Henrik to https://stackoverflow.com/a/10226454/4468078 does explain all details to answer your question.
This (modified) version of your example function does not modify the passed data.table
:
library(data.table)
dt <- data.table(id = 1:4, a = LETTERS[1:4])
myfun2 <- function(mydata) {
x <- mydata[, .(newcolumn = .N), by=id]
setnames(x, "newcolumn", "Count")
return(table(x$Count))
}
myfun2(dt)
This does not copy the whole data.table
(which would be a waste of RAM and CPU time) but only writes the result of the aggregation into a new data.table
which you can modify without side effects (= no changes of the original data.table
).
> str(dt)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables:
$ id: int 1 2 3 4
$ a : chr "A" "B" "C" "D"
A data.table
is always passed by reference to a function so you have to be careful not to modify it unless you are absolutely sure you want to do this.
The data.table
package was designed exactly for this efficient way of modifying data without the usual "COW" ("copy on (first) write") principle to support efficient data manipulation.
"Dangerous" operations that modify a data.table
are mainly:
:=
assignment to modify or create a new column "in-place"
- all
set*
functions
If you don't want to modify a data.table
you can use just row filters, and column (selection) expressions (i
, j
, by
etc. arguments).
Chaining does also prevent the modification of the original data.frame
if you modify "by ref" in the second (or later) chain:
myfun3 <- function(mydata) {
# chaining also creates a copy
return(mydata[id < 3,][, a := "not overwritten outside"])
}
myfun3(dt)
# > str(dt)
# Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables:
# $ id: int 1 2 3 4
# $ a : chr "A" "B" "C" "D"