2

I've looked at the discussion What is the difference between as.tibble(), as_data_frame(), and tbl_df()? to figure out why a replace_na function (shown below) works on data frames but not on tibbles. Could you help me understand why it doesn't work on tibbles? How can the function be modified so it can work for both data.frame and tibble?

Data

library(dplyr)

#dput(df1)
df1 <- structure(list(id = c(1, 2, 3, 4), gender = c("M", "F", NA, "F"
), grade = c("A", NA, NA, NA), age = c(2, NA, 2, NA)), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

#dput(df2)
df2 <- structure(list(id = c(1, 2, 3, 4), gender = c("M", "F", "M", 
"F"), grade = c("A", "A", "B", "NG"), age = c(22, 23, 21, 19)), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

replace function

replace_na <- function(df_to, df_from) {
  replace(df_to, is.na(df_to), df_from[is.na(df_to)])
}

Usage

replace_na(df1,df2)

Error: Must use a vector in [, not an object of class matrix.

Call rlang::last_error() to see a backtrace

Called from: abort(error_dim_column_index(j))

However; coercing the arglist to data frame produces the desired output as shown below.

replace_na(as.data.frame(df1), as.data.frame(df2))
#   id gender grade age
# 1  1      M     A   2
# 2  2      F     A  23
# 3  3      M     B   2
# 4  4      F    NG  19

Thank you.

Community
  • 1
  • 1
deepseefan
  • 3,701
  • 3
  • 18
  • 31

1 Answers1

3

is.na() returns a logical matrix for a data frame:

is.na(df1) 
#>         id gender grade   age
#> [1,] FALSE  FALSE FALSE FALSE
#> [2,] FALSE  FALSE  TRUE  TRUE
#> [3,] FALSE   TRUE  TRUE FALSE
#> [4,] FALSE  FALSE  TRUE  TRUE

The base data.frame class supports subsetting with a matrix; tbl_df is more strict, and does not.

as.data.frame(df2)[is.na(df1)]
#> [1] "M"  "A"  "B"  "NG" "23" "19"
df2[is.na(df1)]
#> Must use a vector in `[`, not an object of class matrix.

To make your replace_na() function work with a tbl_df you need to do the operation separately for each column. For example, with recursion:

replace_na <- function(x, y) {
  if (is.data.frame(x)) {
    x[] <- Map(replace_na, x, y)
    return(x)
  }

  replace(x, is.na(x), y[is.na(x)])
}

replace_na(df1, df2)
#> # A tibble: 4 x 4
#>      id gender grade   age
#>   <dbl> <chr>  <chr> <dbl>
#> 1     1 M      A         2
#> 2     2 F      A        23
#> 3     3 M      B         2
#> 4     4 F      NG       19

This method is also generally faster:

replace_na_vec <- function(x, y) {
  replace(x, is.na(x), y[is.na(x)])
}

df1_10k <- do.call("rbind", replicate(10000, df1, simplify = FALSE))
df2_10k <- do.call("rbind", replicate(10000, df2, simplify = FALSE))

bench::mark(
  check = FALSE,
  new = replace_na(df1, df2),
  old = replace_na_vec(as.data.frame(df1), as.data.frame(df2)),
  new_10k = replace_na(df1_10k, df2_10k),
  old_10k = replace_na_vec(as.data.frame(df1_10k), as.data.frame(df2_10k))
)
#> # A tibble: 4 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new         74.01us  97.79us   7295.          0B    12.6 
#> 2 old        269.97us 529.93us   1845.     81.02KB     8.23
#> 3 new_10k      1.82ms   2.75ms    338.      4.27MB    32.3 
#> 4 old_10k     94.29ms 104.05ms      9.68   10.24MB     2.42

Created on 2019-09-12 by the reprex package (v0.3.0)

Mikko Marttila
  • 10,972
  • 18
  • 31
  • Thank you @Mikko, and though not part of the question, I wonder how doing separately interplay with performance when compared with coercing the `arglist` using `as.data.frame`? – deepseefan Sep 12 '19 at 10:55
  • 1
    @deepseefan I added a quick benchmark: it would seem the separate-approach is generally faster than using matrix subsetting with `as.data.frame()`. – Mikko Marttila Sep 12 '19 at 11:58