0

SO #24833247 covers nearly all the use cases for passing column names dynamically to a data.table within a function. However it misses one I'm currently trying to address: passing variables to the i expression.

I'm trying to refactor some data cleansing code to a function that converts certain values to NA after I've pulled the data into a data.table

For example, given the following:

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
dt[colA %in% c('~', ''), colA := NA]
dt[colB %in% c('~', ''), colB := NA]

I want a generic function that replaces the '~', '?' and '' values with NA, instead of having to explicitly code each transformation.

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
clearCol(dt, colA)
clearCol(dt, colB)

The j expression is straight-forward

clearCol <- function(dt, f) {
  f = substitute(f)
  dt[,(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '',)), colA)[]
    x
1: NA
2: NA
3: NA
4: NA
5: NA

However, extending it to add the variable to the i expression fails:

clearCol <- function(dt, f) {
  f = substitute(f)
  dt[(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
 Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments

Swapping to this seems to work, but the lack of output with verbose = TRUE (compared to the hard-coded method at the top) leaves me concerned that it will not scale well when given the large data sets I'm working with

clearCol <- function(dt, f) {
  f = deparse(substitute(f))
  dt[get(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
   colA
1:    A
2:    b
3:   NA
4:    d
5:   NA

Is there another way of doing what I want?

Community
  • 1
  • 1

1 Answers1

1

You can follow FAQ 1.6 to get the verbose output:

cc = function(d, col, vs = c("~", ""), verb = FALSE){
  col = substitute(col)
  ix  = substitute(col %in% vs)
  d[eval(ix), as.character(col) := NA, verbose = verb ][]
}

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
cc(dt, colA, verb = TRUE)

which gives

Creating new index 'colA'
Starting bmerge ...done in 0 secs
Detected that j uses these columns: <none> 
Assigning to 2 row subset of 5 rows
Dropping index 'colA' due to update on 'colA' (column 1)
   colA colB
1:    A     
2:    b    ?
3:   NA   a1
4:    d   a2
5:   NA   z4

However, notice what the verbose output is saying here. It's creating an index (assuming you didn't do something to create it already, which seems likely since the data was only just read in)... and then it's removing that index (since it is invalidated by the edit to the column). That hardly sounds like something that would do much to contribute to efficiency.

If you really want to do this efficiently, there are a couple options:

  1. Use na.strings when reading the data in
  2. Use set if you have a ton of columns and somehow can't do #1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    In the actual use case, the data is coming through a JDBC database connection, so `na.strings` isn't an option. However, the substitute/eval combination definitely gets me want I want. Thanks – Stephen Zander Nov 18 '16 at 19:38
  • Ok. Btw, data.table, R and other things are discussed over here, mostly on weekdays: http://chat.stackoverflow.com/rooms/75819/gmts I don't want to spam github emails, but that's an interesting example in your last comment there. Odd that you again managed to have typos since it looks like copy-paste. – Frank Nov 18 '16 at 19:42
  • The lesson is run your code, don't just write your code – Stephen Zander Nov 18 '16 at 19:47