data.table fill NA by custom function and other cells

Question

Assume we have a data.table like:

library(data.table)

set.seed(123666)


dt <- data.table(
  id = seq(1, 5), 
  sample1 = c(sample(c(NA, runif(2))), NA),      
  sample2 = c(NA, sample(c(NA, runif(3)))),    
  sample3 = c(sample(c(NA, runif(4))))      
)

dt
   id   sample1   sample2   sample3
1:  1        NA        NA 0.6387276
2:  2 0.9293370 0.1875354 0.2087892
3:  3 0.1528115        NA 0.7849779
4:  4        NA 0.6875024 0.3684756
5:  5        NA 0.4859773        NA

Its have many NA values, now, we want to fill it, typically, we can use following syntax to do

dt[is.na(dt)] <- 0

dt
   id   sample1   sample2   sample3
1:  1 0.0000000 0.0000000 0.6387276
2:  2 0.9293370 0.1875354 0.2087892
3:  3 0.1528115 0.0000000 0.7849779
4:  4 0.0000000 0.6875024 0.3684756
5:  5 0.0000000 0.4859773 0.0000000

However, if we want to fill NA with more complex rule, a custom function for example, calc_data(), to calc NA. This function need two input, just a example here, first is the id value of NA value, secound is the colname or colname index of the cell.

# example, not real function
sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
calc_data <- function(sample, id){
    na_calc <- id * 3 + sample_value[sample]
}

Now, it is possible to fill NA with this coustom with data.table syntax, how to put its required value to calc_data

@MichaelChirico, I have checked `nafill`, I got that it can fill NA with more rule, but I am not known how to use my own function with `nafill`. The tricky problem is that you need to call some external function through the position of the blank cell, that is, its column name and id, to complete the filling. How to avoid going through the cell one by one and use a more efficient syntax? — zhang, Aug 29 '23 at 07:43
`set(x, is.na(x$col), "col", )` is the fully general approach — MichaelChirico, Aug 29 '23 at 07:49
Yeah, this is a general task if you know the missing value in a certain place, but the problem is that in our problem, the value of NA must depend on its position in the table — zhang, Aug 29 '23 at 07:55
Could you provide a concrete example of which NA values you want to replace? — s_baldur, Aug 29 '23 at 09:52
Hi, sorry for confusing, I need to replace all `NA` value, However, the value used to replace NA is not fixed, as mentioned in the example, the replacement value is related to the sample name (column name or column index) and the corresponding value in the id column — zhang, Aug 29 '23 at 10:02
I would like to explain its practical application, in the real case, dt is obtained by combining the output of multiple samples (from a software A) by the id column, where NA is discarded because it does not pass through the filtering of software A. But my downstream analysis requires a valid value instead of NA, I re-implement part of the algorithm in A in R, and I can recalculate this information by calling this new implementation and giving the sample name and id — zhang, Aug 29 '23 at 10:02

score 1 · Answer 1 · answered Aug 29 '23 at 12:08

An alternative that isn't as "efficient" as set (which is the fastest data.table-canonical way to work on its data), we can do

dt
#       id sample1 sample2 sample3
#    <int>   <num>   <num>   <num>
# 1:     1      NA      NA   0.639
# 2:     2   0.929   0.188   0.209
# 3:     3   0.153      NA   0.785
# 4:     4      NA   0.688   0.368
# 5:     5      NA   0.486      NA

sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
fun <- function(x, samp) fcoalesce(x, calc_data(samp, id))
dt[, names(sample_value) := Map(fun, .SD, sample_value),
   .SDcols = names(sample_value)]
#       id sample1 sample2 sample3
#    <int>   <num>   <num>   <num>
# 1:     1   4.000   6.000   0.639
# 2:     2   0.929   0.188   0.209
# 3:     3   0.153  12.000   0.785
# 4:     4  13.000   0.688   0.368
# 5:     5  16.000   0.486  18.000

Map is similar to lapply (call a FUNction once for each element in the list X), but while lapply can operate only on one list, Map "zips" multiple lists/vectors together. In this case, if we were to unroll the Map into a sequence of expressions, we'd effectively get:
```
fun(sample1, 1)
fun(sample2, 3)
fun(sample3, 3)
```
(I used fun as a predefined function for clarity and breakout, though it could easily be used as an anonymous function in the call to Map, as in
```
Map(function(x, samp) fcoalesce(x, calc_data(samp, id)), .SD, sample_value)
```
fcoalesce returns (element-wise along vectors) the first non-NA value found. It takes one or more vectors (I'm using two above, three here for demo). For example,
```
fcoalesce(c(1,2,NA), c(3,NA,4), c(NA,5,6))
# [1] 1 2 4
```
The 1,2 from the first argument are retained. The third position is NA, so it shifts to the third position in the second argument and finds 4, so it returns that as the third. In this case, nothing in the third argument c(NA,5,6) is needed.

Using fcoalesce here is effectively the same as
```
fifelse(is.na(x), calc_data(samp, id), x)
```
which can also be used in the function if you prefer.
I'm (ab)using the names of sample_value to iterate (in .SDcols) and save (as names(sample_values) :=).

score 0 · Accepted Answer · answered Aug 29 '23 at 10:46

perhaps something like this could work

for(j in 2:4) set(dt, 
                  i = which(is.na(dt[[j]])), 
                  j = j, 
                  value = calc_data(j - 1, dt[which(is.na(dt[[j]])), "id"]))

output

   id    sample1    sample2    sample3
1:  1  0.8055845 6.00000000  0.2030456
2:  2  0.5705721 9.00000000  0.7954992
3:  3 10.0000000 0.09605308 12.0000000
4:  4 13.0000000 0.25545666  0.6506906
5:  5  0.8055845 0.51889032  0.8931946

data.table fill NA by custom function and other cells

2 Answers2