2

Assume we have a data.table like:

library(data.table)

set.seed(123666)


dt <- data.table(
  id = seq(1, 5), 
  sample1 = c(sample(c(NA, runif(2))), NA),      
  sample2 = c(NA, sample(c(NA, runif(3)))),    
  sample3 = c(sample(c(NA, runif(4))))      
)
dt
   id   sample1   sample2   sample3
1:  1        NA        NA 0.6387276
2:  2 0.9293370 0.1875354 0.2087892
3:  3 0.1528115        NA 0.7849779
4:  4        NA 0.6875024 0.3684756
5:  5        NA 0.4859773        NA

Its have many NA values, now, we want to fill it, typically, we can use following syntax to do

dt[is.na(dt)] <- 0
dt
   id   sample1   sample2   sample3
1:  1 0.0000000 0.0000000 0.6387276
2:  2 0.9293370 0.1875354 0.2087892
3:  3 0.1528115 0.0000000 0.7849779
4:  4 0.0000000 0.6875024 0.3684756
5:  5 0.0000000 0.4859773 0.0000000

However, if we want to fill NA with more complex rule, a custom function for example, calc_data(), to calc NA. This function need two input, just a example here, first is the id value of NA value, secound is the colname or colname index of the cell.

# example, not real function
sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
calc_data <- function(sample, id){
    na_calc <- id * 3 + sample_value[sample]
}

Now, it is possible to fill NA with this coustom with data.table syntax, how to put its required value to calc_data

zhang
  • 185
  • 7
  • 2
    see ?nafill and ?setnafill – MichaelChirico Aug 29 '23 at 07:30
  • @MichaelChirico, I have checked `nafill`, I got that it can fill NA with more rule, but I am not known how to use my own function with `nafill`. The tricky problem is that you need to call some external function through the position of the blank cell, that is, its column name and id, to complete the filling. How to avoid going through the cell one by one and use a more efficient syntax? – zhang Aug 29 '23 at 07:43
  • 1
    `set(x, is.na(x$col), "col", )` is the fully general approach – MichaelChirico Aug 29 '23 at 07:49
  • Yeah, this is a general task if you know the missing value in a certain place, but the problem is that in our problem, the value of NA must depend on its position in the table – zhang Aug 29 '23 at 07:55
  • Could you provide a concrete example of which NA values you want to replace? – s_baldur Aug 29 '23 at 09:52
  • 1
    Hi, sorry for confusing, I need to replace all `NA` value, However, the value used to replace NA is not fixed, as mentioned in the example, the replacement value is related to the sample name (column name or column index) and the corresponding value in the id column – zhang Aug 29 '23 at 10:02
  • I would like to explain its practical application, in the real case, dt is obtained by combining the output of multiple samples (from a software A) by the id column, where NA is discarded because it does not pass through the filtering of software A. But my downstream analysis requires a valid value instead of NA, I re-implement part of the algorithm in A in R, and I can recalculate this information by calling this new implementation and giving the sample name and id – zhang Aug 29 '23 at 10:02

2 Answers2

1

An alternative that isn't as "efficient" as set (which is the fastest data.table-canonical way to work on its data), we can do

dt
#       id sample1 sample2 sample3
#    <int>   <num>   <num>   <num>
# 1:     1      NA      NA   0.639
# 2:     2   0.929   0.188   0.209
# 3:     3   0.153      NA   0.785
# 4:     4      NA   0.688   0.368
# 5:     5      NA   0.486      NA

sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
fun <- function(x, samp) fcoalesce(x, calc_data(samp, id))
dt[, names(sample_value) := Map(fun, .SD, sample_value),
   .SDcols = names(sample_value)]
#       id sample1 sample2 sample3
#    <int>   <num>   <num>   <num>
# 1:     1   4.000   6.000   0.639
# 2:     2   0.929   0.188   0.209
# 3:     3   0.153  12.000   0.785
# 4:     4  13.000   0.688   0.368
# 5:     5  16.000   0.486  18.000
  • Map is similar to lapply (call a FUNction once for each element in the list X), but while lapply can operate only on one list, Map "zips" multiple lists/vectors together. In this case, if we were to unroll the Map into a sequence of expressions, we'd effectively get:

    fun(sample1, 1)
    fun(sample2, 3)
    fun(sample3, 3)
    

    (I used fun as a predefined function for clarity and breakout, though it could easily be used as an anonymous function in the call to Map, as in

    Map(function(x, samp) fcoalesce(x, calc_data(samp, id)), .SD, sample_value)
    
  • fcoalesce returns (element-wise along vectors) the first non-NA value found. It takes one or more vectors (I'm using two above, three here for demo). For example,

    fcoalesce(c(1,2,NA), c(3,NA,4), c(NA,5,6))
    # [1] 1 2 4
    

    The 1,2 from the first argument are retained. The third position is NA, so it shifts to the third position in the second argument and finds 4, so it returns that as the third. In this case, nothing in the third argument c(NA,5,6) is needed.

    Using fcoalesce here is effectively the same as

    fifelse(is.na(x), calc_data(samp, id), x)
    

    which can also be used in the function if you prefer.

  • I'm (ab)using the names of sample_value to iterate (in .SDcols) and save (as names(sample_values) :=).

r2evans
  • 141,215
  • 6
  • 77
  • 149
0

perhaps something like this could work

for(j in 2:4) set(dt, 
                  i = which(is.na(dt[[j]])), 
                  j = j, 
                  value = calc_data(j - 1, dt[which(is.na(dt[[j]])), "id"]))

output

   id    sample1    sample2    sample3
1:  1  0.8055845 6.00000000  0.2030456
2:  2  0.5705721 9.00000000  0.7954992
3:  3 10.0000000 0.09605308 12.0000000
4:  4 13.0000000 0.25545666  0.6506906
5:  5  0.8055845 0.51889032  0.8931946
Wimpel
  • 26,031
  • 1
  • 20
  • 37