1

I have a fairly large data frame that has multiple "-" which represent missing data. The data frame consisted of multiple Excel files, which could not use the "na.strings =" or alternative function, so I had to import them with the "-" representation.

How can I replace all "-" in the data frame with NA / missing values? The data frame consists of 200 columns of characters, factors, and integers.

So far I have tried:

sum(df %in c("-"))
returns: [1] 0

df[df=="-"] <-NA #does not do anything

library(plyr)
df <- revalue(df, c("-",NA))
returns: Error in revalue(tmp, c("-", NA)) : 
  x is not a factor or a character vector.

library(anchors)
df <- replace.value(df,colnames(df),"-",as.character(NA))
Error in charToDate(x) : 
  character string is not in a standard unambiguous format

The data frame consists of 200 columns of characters, factors, and integers, so I can see why the last two do not work correctly. Any help would be appreciated.

Starbucks
  • 1,448
  • 3
  • 21
  • 49
  • 2
    Can you post a subset of your data as a sample so that this is reproducible on your data? Thanks :) – mysteRious Mar 23 '18 at 20:36
  • How many rows does your data have? – bala83 Mar 23 '18 at 20:41
  • ≈500,000 rows.. – Starbucks Mar 23 '18 at 20:42
  • I think your best option would be to clean up the data row-wise and writing it out to a file. Then you can read your cleaned up file :) – bala83 Mar 23 '18 at 20:43
  • I've tried that. When I re-import the data (the cleaned version with NA's), it turns all cleaned columns in to logical classes... :-/ – Starbucks Mar 23 '18 at 20:44
  • If an entry has a "-" in it then it would have been read as a factor variable by default. You will (probably) need to learn how to convert a factor to a character and then use `is.na<-`. You have failed to include the code used to create your dataframe and there are many different possibilities to what your object actually looks like. If you used colClasses to force it being read as "numeric" then the coercion should have occurred properly. – IRTFM Mar 23 '18 at 20:45
  • Oh there is an option when you're reading data using `read.table()` or `scan` to prevent co-ercing by explicitly specifying colClasses and na.strings. Did you try with these options set ? – bala83 Mar 23 '18 at 20:48

3 Answers3

10

Since you're already using tidyverse functions, you can easily use na_if from dplyr within your pipes.

For example, I have a dataset where 999 is used to fill in a non-answer:

df <- tibble(
    alpha = c("a", "b", "c", "d", "e"), 
    val1 = c(1, 999, 3, 8, 999), 
    val2 = c(2, 8, 999, 1, 2))

If I wanted to change val1 so 999 is NA, I could do:

df %>% 
    mutate(val1 = na_if(val1, 999))

In your case, it sounds like you want to replace a value across multiple variables, so using across for multiple columns would be more appropriate:

df %>%
    mutate(across(c(val1, val2), na_if, 999)) # or val1:val2

replaces all instances of 999 in both val1 and val2 with NA and now looks like this:

# A tibble: 5 x 3
  alpha  val1  val2
  <chr> <dbl> <dbl>
1 a        1.    2.
2 b       NA     8.
3 c        3.   NA 
4 d        8.    1.
5 e       NA     2.
camille
  • 16,432
  • 18
  • 38
  • 60
7

I believe the simplest solution is with base R function is.na<-. It's meant to solve precisely that issue.

First, make up some data. Then set the required values to NA.

set.seed(247)    # make the results reproducible

df <- data.frame(X = 1:10, Y = sample(c("-", letters[1:2]), 10, TRUE))

is.na(df) <- df == "-"
df
#    X    Y
#1   1    a
#2   2    b
#3   3    b
#4   4    a
#5   5 <NA>
#6   6    b
#7   7    a
#8   8 <NA>
#9   9    b
#10 10    a
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • 1
    Another way to use base R and is.na could be: `df[df == "-"] <- NA` Notice that this and the one in this post, do not work in case is a list of arrays. the following one does not work: `df[df %in% c("a","b")] <- NA` You will need to separate in different lines the content in the array: `df[df == "a"] <- NA` `df[df == "b"] <- NA` – Corina Roca May 10 '23 at 09:36
1

Here's a solution that will do it:

> library(dplyr)
> test <- tibble(x = c('100', '20.56', '0.003', '-', '  -'),  y = 5:1)
> makeNA <- function(x) str_replace(x,'-',NA_character_)
> mutate_all(test, funs(makeNA))
# A tibble: 5 x 2
  x     y    
  <chr> <chr>
1 100   5    
2 20.56 4    
3 0.003 3    
4 NA    2    
5 NA    1  
mysteRious
  • 4,102
  • 2
  • 16
  • 36