198

I have a dataframe with some numeric columns. Some row has a 0 value which should be considered as null in statistical analysis. What is the fastest way to replace all the 0 value to NULL in R?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Seen
  • 4,054
  • 4
  • 37
  • 46
  • 18
    I don't think you want/can replace with NULL values, but NA serves that purpose in R lingo. – Chase Jun 14 '12 at 16:12

11 Answers11

339

Replacing all zeroes to NA:

df[df == 0] <- NA



Explanation

1. It is not NULL what you should want to replace zeroes with. As it says in ?'NULL',

NULL represents the null object in R

which is unique and, I guess, can be seen as the most uninformative and empty object.1 Then it becomes not so surprising that

data.frame(x = c(1, NULL, 2))
#   x
# 1 1
# 2 2

That is, R does not reserve any space for this null object.2 Meanwhile, looking at ?'NA' we see that

NA is a logical constant of length 1 which contains a missing value indicator. NA can be coerced to any other vector type except raw.

Importantly, NA is of length 1 so that R reserves some space for it. E.g.,

data.frame(x = c(1, NA, 2))
#    x
# 1  1
# 2 NA
# 3  2

Also, the data frame structure requires all the columns to have the same number of elements so that there can be no "holes" (i.e., NULL values).

Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var, cov, or cor, that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.obs". Typically, however, this is unsatisfactory as it leads to extra information loss.

2. Instead of running some sort of loop, in the solution I use df == 0 vectorization. df == 0 returns (try it) a matrix of the same size as df, with the entries TRUE and FALSE. Further, we are also allowed to pass this matrix to the subsetting [...] (see ?'['). Lastly, while the result of df[df == 0] is perfectly intuitive, it may seem strange that df[df == 0] <- NA gives the desired effect. The assignment operator <- is indeed not always so smart and does not work in this way with some other objects, but it does so with data frames; see ?'<-'.


1 The empty set in the set theory feels somehow related.
2 Another similarity with the set theory: the empty set is a subset of every set, but we do not reserve any space for it.

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • 3
    What would the equivalent syntax be for a data.table object? – itpetersen Dec 07 '14 at 05:33
  • 6
    I see you've gotten a lot of votes but do not think this appropriately covers the edge cases of non-numeric columns with values of "0" which were not requested to be set to . – IRTFM Dec 16 '14 at 02:57
  • Note to self: If it does not work and the dataframe was parsed from csv, make sure your values don't contain whitespaces at the start/end like " ?". – adroste Mar 14 '22 at 10:59
52

Let me assume that your data.frame is a mix of different datatypes and not all columns need to be modified.

to modify only columns 12 to 18 (of the total 21), just do this

df[, 12:18][df[, 12:18] == 0] <- NA
userJT
  • 11,486
  • 20
  • 77
  • 88
42

dplyr::na_if() is an option:

library(dplyr)  

df <- data_frame(col1 = c(1, 2, 3, 0),
                 col2 = c(0, 2, 3, 4),
                 col3 = c(1, 0, 3, 0),
                 col4 = c('a', 'b', 'c', 'd'))

na_if(df, 0)
# A tibble: 4 x 4
   col1  col2  col3 col4 
  <dbl> <dbl> <dbl> <chr>
1     1    NA     1 a    
2     2     2    NA b    
3     3     3     3 c    
4    NA     4    NA d
sbha
  • 9,802
  • 2
  • 74
  • 62
23

An alternative way without the [<- function:

A sample data frame dat (shamelessly copied from @Chase's answer):

dat

  x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0

Zeroes can be replaced with NA by the is.na<- function:

is.na(dat) <- !dat


dat

   x  y
1 NA  2
2  1  2
3  1  1
4  2  1
5 NA NA
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
14
#Sample data
set.seed(1)
dat <- data.frame(x = sample(0:2, 5, TRUE), y = sample(0:2, 5, TRUE))
#-----
  x y
1 0 2
2 1 2
3 1 1
4 2 1
5 0 0

#replace zeros with NA
dat[dat==0] <- NA
#-----
   x  y
1 NA  2
2  1  2
3  1  1
4  2  1
5 NA NA
Chase
  • 67,710
  • 18
  • 144
  • 161
13

Because someone asked for the Data.Table version of this, and because the given data.frame solution does not work with data.table, I am providing the solution below.

Basically, use the := operator --> DT[x == 0, x := NA]

library("data.table")

status = as.data.table(occupationalStatus)

head(status, 10)
    origin destination  N
 1:      1           1 50
 2:      2           1 16
 3:      3           1 12
 4:      4           1 11
 5:      5           1  2
 6:      6           1 12
 7:      7           1  0
 8:      8           1  0
 9:      1           2 19
10:      2           2 40


status[N == 0, N := NA]

head(status, 10)
    origin destination  N
 1:      1           1 50
 2:      2           1 16
 3:      3           1 12
 4:      4           1 11
 5:      5           1  2
 6:      6           1 12
 7:      7           1 NA
 8:      8           1 NA
 9:      1           2 19
10:      2           2 40
Reilstein
  • 1,193
  • 2
  • 11
  • 25
  • 3
    Or `for (j in names(DT)); set(DT,which(DT[[j]] == 0),j,NA)`. See [here](http://stackoverflow.com/a/7249454/4241780) for a more detailed discussion of using data.table to find and replace values. – JWilliman Nov 22 '16 at 00:24
9

In case anyone arrives here via google looking for the opposite (i.e. how to replace all NAs in a data.frame with 0), the answer is

df[is.na(df)] <- 0

OR

Using dplyr / tidyverse

library(dplyr)
mtcars %>% replace(is.na(.), 0)
stevec
  • 41,291
  • 27
  • 223
  • 311
5

Here is my contribution for those who are struggling with datasets with different types of columns with multiple values representing missing data.

dat <- data_frame(numA = c(1, 0, 3, 4),
             numB = c(NA, 2, 3, 4),
             strC = c("0", "1.2", "NA", "2.4"),
             strD = c("Yes", "Yes", "missing", "No"))

Let's say in this data we want to replace 0 in numeric columns with NA as well as 'NA' and 'missing' values in character/string values with NA. Notice that 'NA' in strC column is a character type value, not the desired NA.

dat
# A tibble: 4 x 4
  numA   numB  strC  strD   
  <dbl>  <dbl> <chr> <chr>  
1     1     NA 0     Yes    
2     0      2 1.2   Yes    
3     3      3 'NA'  missing
4     4      4 2.4   No 

First, an obvious case, notice that when converting a character column to numeric values any non-numeric string value is coerced to NA.

as.numeric(dat$strC)
[1] 0.0 1.2  NA 2.4 

Answer with indexing:

dat[dat == "NA" | dat =="missing"] <- NA

However, do NOT use that for 0 because it changes both numeric and character 0s to NA. This is because "0" == 0 returns TRUE in R.

dplyr::na_if method:

library(dplyr)

dat %>%
  lapply(na_if, y = "missing") %>%
  lapply(na_if, y = "NA") %>%
  lapply(na_if, y = 0) %>%  # DONT DO THIS! It converts string 0s to NA as well!
  data.frame()

Here we apply na_if function to each column of the data. Since na_if does not accept multiple values to be converted to NA we need to write multiple lines of code for each value to be converted into NA. However, simple usage of this function with 0 converts both the numeric and character 0s into NA. We need to do something else!

Using mutate across method with na_if function:

This is my favorite solution. Here we check the column type and apply na_if function as necessary. The character 0 is untouched, whereas all desired values are converted into NA.

dat %>%
  mutate(across(where(is.numeric), ~na_if(., 0))) %>%
  mutate(across(where(is.character), ~na_if(., "NA"))) %>%
  mutate(across(where(is.character), ~na_if(., "missing")))

# A tibble: 4 x 4
   numA  numB strC  strD 
  <dbl> <dbl> <chr> <chr>
1     1    NA 0     Yes  
2    NA     2 1.2   Yes  
3     3     3 NA    NA   
4     4     4 2.4   No 

Finally, nariar package can be used

nariar is a recent package that introduces a variety of replace_with_ functions.

library(naniar)

Replace all 'NA' and 'missing' values to NA:

dat %>%
  replace_with_na_all(~.x %in% c("NA", "missing"))

but if you use this with 0s, it still erroneously converts the character 0 to NA:

dat %>%
  replace_with_na_all(~.x %in% c(0, "NA", "missing"))

# A tibble: 4 x 4
   numA  numB strC  strD 
  <dbl> <dbl> <chr> <chr>
1     1    NA NA    Yes  
2    NA     2 1.2   Yes  
3     3     3 NA    NA   
4     4     4 2.4   No
#strC's first element should not be NA here!

So, we have to specify column type using replace_with_na_if:

dat %>%
  replace_with_na_if(is.character, ~.x %in% c("NA", "missing")) %>%
  replace_with_na_if(is.numeric, ~.x %in% c(0))

# A tibble: 4 x 4
   numA  numB strC  strD 
  <dbl> <dbl> <chr> <chr>
1     1    NA 0     Yes  
2    NA     2 1.2   Yes  
3     3     3 NA    NA   
4     4     4 2.4   No

We achieved the desired outcome. I hope all this is helpful :)

MECoskun
  • 789
  • 6
  • 12
4

You can replace 0 with NA only in numeric fields (i.e. excluding things like factors), but it works on a column-by-column basis:

col[col == 0 & is.numeric(col)] <- NA

With a function, you can apply this to your whole data frame:

changetoNA <- function(colnum,df) {
    col <- df[,colnum]
    if (is.numeric(col)) {  #edit: verifying column is numeric
        col[col == -1 & is.numeric(col)] <- NA
    }
    return(col)
}
df <- data.frame(sapply(1:5, changetoNA, df))

Although you could replace the 1:5 with the number of columns in your data frame, or with 1:ncol(df).

Alium Britt
  • 1,246
  • 4
  • 13
  • 25
  • I am not sure this is correct solution. What about columns 6 and more. They will get cut. – userJT Feb 19 '15 at 10:44
  • That's why I suggested replacing `1:5` with `1:ncol(df)` at the end. I didn't want to make the equation overly complex or difficult to read. – Alium Britt Feb 19 '15 at 11:55
  • but what if in the columns 6 and 7 - the datatype is char and no replacement should be done. In my problem, I need replacement only in columns 12 to 15 but the whole df has 21 columns (many must not be touched at all). – userJT Feb 20 '15 at 14:05
  • For your data frame you could just change the `1:5` to the column numbers you want changed, like `12:15`, but if you wanted to confirm that it will only affect numeric columns then just wrap the second line of the function in an if statement, like this: `if (is.numeric(col)) { col[col == -1 & is.numeric(col)] <- NA }`. – Alium Britt Feb 20 '15 at 20:23
1

If you are like me and landed here while wondering how to replace ALL values in a dataframe with NA, it's just:

df[,] <- NA
sos_llc
  • 71
  • 3
0

Another option is to replace all 0 with NA using mutate_all like this:

library(dplyr)
df <- data.frame(v1 = c(1,0,4,2),
                 v2 = c(3,1,0,0))
df
#>   v1 v2
#> 1  1  3
#> 2  0  1
#> 3  4  0
#> 4  2  0
mutate_all(df, ~replace(., .==0, NA))
#>   v1 v2
#> 1  1  3
#> 2 NA  1
#> 3  4 NA
#> 4  2 NA

Created on 2022-07-10 by the reprex package (v2.0.1)

Quinten
  • 35,235
  • 5
  • 20
  • 53