1

I am trying to replace a set of typos in a df,

This is what I've got so far:

master_df <- invisible(
  data.frame(lapply(master_df, 
                    function(x) replace(x, as.matrix(x) == c("?", '-',''), NA))))

However the output looks as follows:

#  a    b    c
#1        <NA>
#2 ? <NA> <NA>
#3 1    2    1
#4 2    3    2
#5 3    4    3

And throws the next warnings:

Warning messages: 1: In as.matrix(x) == c("?", "-", "") : longitud de objeto mayor no es múltiplo de la longitud de uno menor

2: In as.matrix(x) == c("?", "-", "") : longitud de objeto mayor no es múltiplo de la longitud de uno menor

3: In as.matrix(x) == c("?", "-", "") : longitud de objeto mayor no es múltiplo de la longitud de uno menor

The idea is that the set of typos c('?', '-', '') are replaced by NA in the whole df.

How could I accomplish this task?

data

master_df <- structure(list(a = c("", "?", "1", "2", "3"), b = c("", NA, "2", 
"3", "4"), c = c(NA, NA, "1", "2", "3")), class = "data.frame", row.names = c(NA, 
-5L))
AlSub
  • 1,384
  • 1
  • 14
  • 33

2 Answers2

2

We need %in% instead of == as == is elementwise comparison operator

library(dplyr)
master_df2 <- master_df %>%
    mutate(across(everything(), 
  ~ replace(., . %in% c("?", '-', ''), NA_character_))) %>% 
    type.convert(as.is = TRUE)

Or using base R

master_df[] <- lapply(master_df, function(x)
      replace(x, x %in% c("?", '-', ''), NA_character_))

Or using gsub

master_df[] <- gsub('^(\\?|-|)$', NA, as.matrix(master_df))
master_df <- type.convert(master_df, as.is = TRUE)

A better option is to specify na.strings = c("?", "-", "") while reading the data with read.csv/read.table

akrun
  • 874,273
  • 37
  • 540
  • 662
2

Perhaps you can try the code below

master_df[] <- replace(as.matrix(master_df), as.matrix(master_df) %in% c("?", "-"), NA)

which gives

> master_df
     a    b    c
1           <NA>
2 <NA> <NA> <NA>
3    1    2    1
4    2    3    2
5    3    4    3
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81