selecting variables data.table

Question

I am writing a function to select values from data.table variables depending on conditions. The problem is that when the variable name does not match the name of the function argument, it is not selected correctly. The code is below.

library(data.table)

dt <- data.table(x = c(1, 2, 3, NA, NA), 
                 y = c(2, 4, 3, 5, NA))

dd <- data.table(p = c(1, 2, 3, NA, NA), 
                 q = c(2, 4, 3, 5, NA))

is.data.table(dt)
is.data.table(dd)


variable_chooser <- function(dt, x , y ) {

  dt[!is.na(x), z := x]
  dt[is.na(x) & !is.na(y), z := y]
  dt[is.na (x) & is.na(y), z := NA]

}

variable_chooser(dt, dt$x, dt$y)
variable_chooser(dd, dd$p, dd$q)

dt
dd

The 2 data sets look like this at the end.

> dt
    x  y  z
1:  1  2  1
2:  2  4  2
3:  3  3  3
4: NA  5  5
5: NA NA NA


 > dd
    p  q  z
1:  1  2  1
2:  2  4  2
3:  3  3  3
4: NA  5  2
5: NA NA NA

The dd dataset has the value of the fourth row of the z variable taken from the first row of q rather than the fourth. With dt, the code works as I expected. How do I make the code for dd work in the same way?

Thank you.

I am guessing if you did `variable_chooser <- function(myData, x , y ) {...` then dt would fail, too. — zx8754, Nov 19 '18 at 12:53
No, it works as expected if the dataset name is not the same. — 9314197, Nov 19 '18 at 13:24

score 0 · Accepted Answer · answered Nov 19 '18 at 13:10

Referencing by variable is not as simple as it could be, but it's still good with data.table to work with the column names themselves. See this for information on how to use custom variable names: Referring to data.table columns by names saved in variables

Here is an example with get. I've changed it so that it uses the column name rather than the column values. I used NA_real_ to set up the NA column first.

variable_chooser <- function(dt, xvar, yvar) {

  dt[, z := NA_real_]
  dt[!is.na(get(xvar)), z := get(xvar)]
  dt[is.na(get(xvar)) & !is.na(get(yvar)), z := get(yvar)]

  return(dt)
}

dt2 <- variable_chooser(dt=dt, xvar="x", yvar="y")
dd2 <- variable_chooser(dt=dd, xvar="p", yvar="q")

dt2[]
dd2[]

If you want to guarantee that the original objects e.g. dt & dd don't change in the function, you can use dt=copy(dd), to make a copy of the object and keep the original in tact.

Thank you. That works. Can I ask why you add `NA_real` first? It seems to work even without that. I actually prefer it with variable names rather than values. — 9314197, Nov 19 '18 at 15:02
I prefer to create the blank column structure first, then apply the rules. I think its a bit more robust. The reason I used `NA_real_` is because the classes of `x` & `y` are numeric, and if you do `class(NA_real_)`, compared to `class(NA)` you see default is logical. It also shortens the code as you now have 2 rules rather than 3. See https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/NA for other classes — Jonny Phelps, Nov 19 '18 at 15:10

selecting variables data.table

1 Answers1