3
library(data.table)

DT = data.table(iris)

The iris data as a data.table

str(DT)
> Classes ‘data.table’ and 'data.frame':  150 obs. of  5 variables:
>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... 
>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... 
>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... 
>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1
>  - attr(*, ".internal.selfref")=<externalptr>

This is just a simple function to add up numeric parts of iris by removing the factor column.

myfun = function(dt){
    dt[,Species:=NULL]
    return(sum(dt))
}

Run the function

myfun(DT)  
> [1] 2078.7

Now DT is missing the Species column in the global environment

str(DT)
> Classes ‘data.table’ and 'data.frame':  150 obs. of  4 variables:
>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
>  - attr(*, ".internal.selfref")=<externalptr>
geneorama
  • 3,620
  • 4
  • 30
  • 41
  • 4
    data tables use call-by-reference semantics: see http://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another – Ben Bolker Oct 08 '13 at 18:29
  • 1
    Use `copy(dt)[,Species:=NULL]` to make sure changes are not applied globally. There are advantages and disadvantages to calling by reference, this is a disadvantage. – Señor O Oct 08 '13 at 18:44
  • I'm aware of the pass by reference nature of `data.table`, and of `data.table::copy`, and of the various methods for testing for numeric columns and specifying columns. I was not aware that a modification to a local variable within a function could ever affect the global variable for any R object. – geneorama Oct 08 '13 at 22:21

2 Answers2

3

data.table works by reference. This is what makes it so fast and useful.

But this also means you have to be careful when passing arguments in functions. If you are not passing a copy, you will alter the original object.

myfun = function(dt){
    # Use something like this
    dt <- copy(dt)    <~~~~~ KEY LINE
    dt[,Species:=NULL]
    return(sum(dt))
}

Alternatively, you could just call copy when you call your function as so:

 myfun(copy(DT))

But I think that leaves too much room for mistakes.

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • The only place that the word "reference" appears in any data.table documentation is in FAQ 2.21, which describes the behavior of `:=` when used in the global scope. From this documentation and because of the way R's scoping rules work in functions, I wouldn't expect that you need to make a copy of a data table within a function to use it normally. – geneorama Oct 10 '13 at 00:17
  • 2
    @geneorama So? Now you know; no one is criticizing you for not knowing before. By the way, the FAQ is not the only documentation for the package; the phrase "by reference" shows up 7 times in `?data.table`, for example. – Frank Oct 10 '13 at 00:28
  • 1
    @geneorama, In the data.table documentation, available [HERE on CRAN](http://cran.r-project.org/web/packages/data.table/data.table.pdf) , The very first words after the table of contents are "`:= Assignment by reference`" -- It appears another `40` times after then. – Ricardo Saporta Oct 10 '13 at 02:03
  • 1
    You're right, I was looking at the FAQ and the Intro, I forgot about the reference manual. My reading of the "by reference" documentation refers to data.table internals and describes how `:=` is used to make new columns. I still don't see that it warns the user that functions within R become pass by reference for data.tables instead of pass by value as is normally the case. – geneorama Oct 10 '13 at 05:43
  • I changed the title to a question so that it makes sense to accept this as an answer. I guess the real answer is "yep, that's what it's supposed to do, make copies if you don't want it to do that". – geneorama Oct 10 '13 at 05:49
  • @geneorama, the confusion makes sense. That great part is that once you get the hang of it, you can start doing incredible things with this! A helpful tip: I use a different naming convention for functions which I expect to modify in place. This helps me when reading my own code to understand whats going on – Ricardo Saporta Oct 10 '13 at 13:25
  • 1
    Just for the record, I love `data.table`. For me it's probably the most valuable thing in R. I presented on it to the local R group last year. See "Data Tables: An introduction (and pitch)" at http://datatable.r-forge.r-project.org/ – geneorama Oct 10 '13 at 19:18
  • Why this does not happen if you make a join inside the function? DT_1[DT_2, on = "x", ("y") := get("y")] In that situation the variable y is not added to the data.table – Frish Vanmol May 06 '21 at 20:09
0

It's a duplicate, found by searching for: [r] select columns data.table

Any of these work:

> sum(DT[,!"Species"])
[1] 2078.7
> sum(DT[,1:4])
[1] 2078.7
> sum(DT[,-5])
[1] 2078.7

'Species' is still in DT.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 2
    If we change the title to "Editing data.table columns in a local environment affects the global environment", then we can point people here when this complaint comes up again. The OP's problem/goal is a dupe, but not the diagnosis (violation of scoping/editing by reference), I guess... By the way, here's another flavor: `sum(\`[.listof\`(DT,sapply(DT,is.numeric)))` – Frank Oct 08 '13 at 20:10
  • How about: "Assignment using := within data.table is permanent." – IRTFM Oct 08 '13 at 20:30
  • Yeah, that's good. It should show up in the right searches (made by folks encountering this data.table feature for the first time)...or maybe the same title with "...is global" (though that may be slightly inaccurate)? Also, I guess it also extends to other in-place modifications (like `setkey`, `setcolorder`, `set`) besides `:=`... – Frank Oct 08 '13 at 20:36
  • Do you think that "data.table violates scope rules within function calls" would be rude? I think this is an issue and a bug. If it's a feature, then it should be more clearly documented. I've read the documentation many times and never noticed this. Normally I would use `get` with `envir = .GlobalEnv` within a function to access global variables (but I normally avoid doing that). – geneorama Oct 09 '13 at 20:11
  • `data.table` documentation is very clear that it is not using either normal R semantics or R syntax for expressions in the 'j' position. I think this particular behavior is just a consequence of the modify-in-place design of data.table-objects and functions. – IRTFM Oct 09 '13 at 20:25
  • Your comment has nothing to do with scope or environment, which is my question. – geneorama Oct 10 '13 at 00:11
  • The `:=` function should be thought of as combining non-standard evaluation with the semantics of `<<-` where the environment is the enclosing data.frame object. If you think this is a bug, then you should just abandon the use of `data.table`. You will never be happy using it. – IRTFM Oct 10 '13 at 00:42