1

I've been trying for a while to replace N/A entries in a data-frame with values of my choice without success. I checked the sources and tried the code below. Can anyone point out why my commands don't work in spite some sources suggesting they should?

The data-frame exampleDF below contains some N/A entries under the column "zacko":

> exampleDF
             dates random letters action    zacko
1  2018-10-30 00:05:19     10       a     go   Mickey
2  2018-10-30 13:58:39      2       b    run   Donald
3  2018-10-31 03:51:59      1       c    fly     <NA>
4  2018-10-31 17:45:19     10       d    sit    Goofy
5  2018-11-01 07:38:39     10       e   jump    Daisy
6  2018-11-01 21:31:59     13       f   hike     <NA>
7  2018-11-02 11:25:19      6       g  dance     <NA>
8  2018-11-03 01:18:39      6       h     go Dagobert
9  2018-11-03 15:11:59      8       i  dance     <NA>
10 2018-11-04 05:05:19      6       j    run    Pluto
11 2018-11-04 18:58:39      2       k    sit     <NA>
12 2018-11-05 08:51:59      6       l  laugh   Minnie
13 2018-11-05 22:45:19      3       m    cry   Gustav
14 2018-11-06 12:38:39     11       n  write Reginald
15 2018-11-07 02:31:59      1       o    fly     <NA>

I looked at Correct syntax for mutate_if and tried to replace these entries with values of my choice accordingly as per

exampleDF %>% mutate_if(is.character, funs(ifelse(is.na(.), 
"REPLACEMENT",.)))
        Warning message:
funs() is soft deprecated as of dplyr 0.8.0
please use list() instead
        # Before:
funs(name = f(.)
        # After: 
list(name = ~f(.))

> exampleDF %>% mutate_if(is.character, list(ifelse(is.na(.), 
"REPLACEMENT",.)))
Error: Can't create call to non-callable object
Call `rlang::last_error()` to see a backtrace

without success (as you can see from the error messages). Interestingly, the commands below work like a charm at the console:

> df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"), z = list(1:5, NULL, 
10:20))
> df
# A tibble: 3 x 3
      x y     z         
  <dbl> <chr> <list>    
1     1 a     <int [5]> 
2     2 NA    <NULL>    
3    NA b     <int [11]>
> df %>% replace_na(list(x = 0, y = "unknown"))
# A tibble: 3 x 3
      x y       z         
  <dbl> <chr>   <list>    
1     1 a       <int [5]> 
2     2 unknown <NULL>    
3     0 b       <int [11]>

> df %>% mutate(x = replace_na(x, 0))
# A tibble: 3 x 3
      x y     z         
  <dbl> <chr> <list>    
1     1 a     <int [5]> 
2     2 NA    <NULL>    
3     0 b     <int [11]>

Why don't the equivalent commands work for my data-frame? See error messages below:

exampleDF %>% replace_na(list(dates = as.POSIXct("2018-10-30 13:58:39"), 
random = 5, letters = "a", action = "crying", zacko = "FRUSTRATION"))
                 dates random letters action    zacko
1  2018-10-30 00:05:19     10       a     go   Mickey
2  2018-10-30 13:58:39      2       b    run   Donald
3  2018-10-31 03:51:59      1       c    fly     <NA>
4  2018-10-31 17:45:19     10       d    sit    Goofy
5  2018-11-01 07:38:39     10       e   jump    Daisy
6  2018-11-01 21:31:59     13       f   hike     <NA>
7  2018-11-02 11:25:19      6       g  dance     <NA>
8  2018-11-03 01:18:39      6       h     go Dagobert
9  2018-11-03 15:11:59      8       i  dance     <NA>
10 2018-11-04 05:05:19      6       j    run    Pluto
11 2018-11-04 18:58:39      2       k    sit     <NA>
12 2018-11-05 08:51:59      6       l  laugh   Minnie
13 2018-11-05 22:45:19      3       m    cry   Gustav
14 2018-11-06 12:38:39     11       n  write Reginald
15 2018-11-07 02:31:59      1       o    fly     <NA>
Warning messages:
1: In `[<-.factor`(`*tmp*`, !is_complete(data[[var]]), value = "crying") :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, !is_complete(data[[var]]), value = 
"FRUSTRATION") :
  invalid factor level, NA generated


> exampleDF %>% mutate(zacko = replace_na(zacko, "GAGA"))
                 dates random letters action    zacko
1  2018-10-30 00:05:19     10       a     go   Mickey
2  2018-10-30 13:58:39      2       b    run   Donald
3  2018-10-31 03:51:59      1       c    fly     <NA>
4  2018-10-31 17:45:19     10       d    sit    Goofy
5  2018-11-01 07:38:39     10       e   jump    Daisy
6  2018-11-01 21:31:59     13       f   hike     <NA>
7  2018-11-02 11:25:19      6       g  dance     <NA>
8  2018-11-03 01:18:39      6       h     go Dagobert
9  2018-11-03 15:11:59      8       i  dance     <NA>
10 2018-11-04 05:05:19      6       j    run    Pluto
11 2018-11-04 18:58:39      2       k    sit     <NA>
12 2018-11-05 08:51:59      6       l  laugh   Minnie
13 2018-11-05 22:45:19      3       m    cry   Gustav
14 2018-11-06 12:38:39     11       n  write Reginald
15 2018-11-07 02:31:59      1       o    fly     <NA>
Warning message:
In `[<-.factor`(`*tmp*`, !is_complete(data), value = "GAGA") :
  invalid factor level, NA generated

I would have expected that my code above works, as per examples given at Correct syntax for mutate_if and examples given under help-file for replace_na(data, replace, ...) (requiring tidyr package).

Yozef
  • 113
  • 8

2 Answers2

2

In fact, your problems are not due to non working replacement, but to the fact that zacko is a factor.

Regarding your first attempt: despite the warning, the attempt works correctly and replaces the NA's with "REPLACEMENT" (but see explanation about factors below!). The new syntax is a little different, to use list instead of funs, you have to use tilde like this:

exampleDF %>% mutate_if(is.character, list(~ ifelse(is.na(.), "REPLACEMENT", .)))

The other one also works... or rather, would work, if zacko was a character vector. Apparently (I don't know it for sure, because you chose not to use dput to give us your example data) exampleDF$zacko is a factor. If you try to enter a value in a factor if that value is not one of the levels, you get this error:

> x <- factor(c("a", "b", "c"))
> x[1] <- "REPLACEMENT"
Warning message:
In `[<-.factor`(`*tmp*`, 1, value = "REPLACEMENT") :
  invalid factor level, NA generated
> x
[1] <NA> b    c   
Levels: a b c

So you did replace it, but since it was a factor, and REPLACEMENT was not one of the levels, it has been replaced again by NA. Try this:

exampleDF$zacko <- as.character(exampleDF$zacko)

Your code should now work fine. Alternatively, if you want to keep it as a factor, add "FRUSTRATION" to the levels of zacko:

levels(exampleDF$zacko) <- c(levels(exampleDF$zacko), "FRUSTRATION")

Note also that by default, data.frame turns character vectors into factors:

> foo <- data.frame(zacko=letters[1:5])
> foo$zacko
[1] a b c d e
Levels: a b c d e

This is a very annoying and dangerous behavior. You don't want that! That is why many users of R set the following in their profiles:

options(stringsAsFactors=FALSE)

A tibble or data table does not behave like that:

> foo <- tibble(zacko=letters[1:5])
> foo$zacko
[1] "a" "b" "c" "d" "e"

Finally, in this simple case I would probably just use good old base R:

exampleDF$zacko[ is.na(exampleDF$zacko) ] <- "REPLACEMENT"
January
  • 16,320
  • 6
  • 52
  • 74
0

I try to avoid factors and use if_na() to do this. First I convert zacko from factor to character.

Code

library(hablar)

df %>% 
  convert(chr(zacko)) %>% 
  mutate_if(is.character, ~if_na(., "REPLACEMENT"))

Result

   random zacko      
    <int> <chr>      
 1     10 Mickey     
 2      2 Donald     
 3      1 REPLACEMENT
 4     10 Goofy      
 5     10 Daisy      
 6     13 REPLACEMENT
 7      6 REPLACEMENT
 8      6 Dagobert   
 9      8 REPLACEMENT
10      6 Pluto      
11      2 REPLACEMENT
12      6 Minnie     
13      3 Gustav     
14     11 Reginald   
15      1 REPLACEMENT

Data

df <- structure(list(random = c(10L, 2L, 1L, 10L, 10L, 13L, 6L, 6L, 
                                8L, 6L, 2L, 6L, 3L, 11L, 1L), zacko = structure(c(6L, 3L, NA, 
                                                                                  4L, 2L, NA, NA, 1L, NA, 8L, NA, 7L, 5L, 9L, NA), .Label = c("Dagobert", 
                                                                                                                                              "Daisy", "Donald", "Goofy", "Gustav", "Mickey", "Minnie", "Pluto", 
                                                                                                                                              "Reginald"), class = "factor")), class = c("tbl_df", "tbl", "data.frame"
                                                                                                                                              ), row.names = c(NA, -15L))
davsjob
  • 1,882
  • 15
  • 10