0

I am writing a function to select values from data.table variables depending on conditions. The problem is that when the variable name does not match the name of the function argument, it is not selected correctly. The code is below.

library(data.table)

dt <- data.table(x = c(1, 2, 3, NA, NA), 
                 y = c(2, 4, 3, 5, NA))

dd <- data.table(p = c(1, 2, 3, NA, NA), 
                 q = c(2, 4, 3, 5, NA))

is.data.table(dt)
is.data.table(dd)


variable_chooser <- function(dt, x , y ) {

  dt[!is.na(x), z := x]
  dt[is.na(x) & !is.na(y), z := y]
  dt[is.na (x) & is.na(y), z := NA]

}

variable_chooser(dt, dt$x, dt$y)
variable_chooser(dd, dd$p, dd$q)

dt
dd

The 2 data sets look like this at the end.

> dt
    x  y  z
1:  1  2  1
2:  2  4  2
3:  3  3  3
4: NA  5  5
5: NA NA NA


 > dd
    p  q  z
1:  1  2  1
2:  2  4  2
3:  3  3  3
4: NA  5  2
5: NA NA NA

The dd dataset has the value of the fourth row of the z variable taken from the first row of q rather than the fourth. With dt, the code works as I expected. How do I make the code for dd work in the same way?

Thank you.

9314197
  • 241
  • 4
  • 14

1 Answers1

0

Referencing by variable is not as simple as it could be, but it's still good with data.table to work with the column names themselves. See this for information on how to use custom variable names: Referring to data.table columns by names saved in variables

Here is an example with get. I've changed it so that it uses the column name rather than the column values. I used NA_real_ to set up the NA column first.

variable_chooser <- function(dt, xvar, yvar) {

  dt[, z := NA_real_]
  dt[!is.na(get(xvar)), z := get(xvar)]
  dt[is.na(get(xvar)) & !is.na(get(yvar)), z := get(yvar)]

  return(dt)
}

dt2 <- variable_chooser(dt=dt, xvar="x", yvar="y")
dd2 <- variable_chooser(dt=dd, xvar="p", yvar="q")

dt2[]
dd2[]

If you want to guarantee that the original objects e.g. dt & dd don't change in the function, you can use dt=copy(dd), to make a copy of the object and keep the original in tact.

Jonny Phelps
  • 2,687
  • 1
  • 11
  • 20
  • Thank you. That works. Can I ask why you add `NA_real` first? It seems to work even without that. I actually prefer it with variable names rather than values. – 9314197 Nov 19 '18 at 15:02
  • I prefer to create the blank column structure first, then apply the rules. I think its a bit more robust. The reason I used `NA_real_` is because the classes of `x` & `y` are numeric, and if you do `class(NA_real_)`, compared to `class(NA)` you see default is logical. It also shortens the code as you now have 2 rules rather than 3. See https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/NA for other classes – Jonny Phelps Nov 19 '18 at 15:10