How can I most efficiently set 0 vals to NA in a subset of columns?

Question

I have a book on statistics (using R) showing the following:

> pima$diastolic [pima$diastolic = = 0] <- NA
> pima$glucose [pima$glucose == 0] <- NA
> pima$triceps [pima$triceps == 0] <- NA
> pima$insulin [pima$insulin == 0] <- NA
> pima$bmi [pima$bmi == 0] <- NA

Is there a way to do it in one line or more efficiently? I see there are functions such as with, apply, subset for doing similar stuff but could not figure out how to put them together...

Sample data (how do I read this in as a dataframe (like pythons stringio):

  pregnant glucose diastolic triceps insulin  bmi diabetes age     test
1        6     148        72      35       0 33.6    0.627  50 positive
2        1      85        66      29       0 26.6    0.351  31 negative
3        8     183        64       0       0 23.3    0.672  32 positive
4        1      89        66      23      94 28.1    0.167  21 negative
5        0     137        40      35     168 43.1    2.288  33 positive
6        5     116        74       0       0 25.6    0.201  30 negative

If you want to replace all `0`s in in a data frame with `NA`, you can easily do it as: `df[df == 0 ] <- NA` — DatamineR, Apr 06 '16 at 10:32

score 7 · Accepted Answer · edited Apr 06 '16 at 11:06

7

Something like this:

Use lapply() to use a function for every column
In the function, test if the column is numeric. If numeric, then replace zeros with NA, else return the original column, unchanged:

Try this:

pima[] <- lapply(pima, function(x){ if(is.numeric(x)) x[x==0] <- NA else x})

Or for predefined columns

cols = c("diastolic", "glucose", "triceps", "insulin", "bmi")
pima[cols] <- lapply(pima[cols], function(x) {x[x==0] <- NA ; x})

Or using is.na<-

is.na(pima[cols]) <- pima[cols] == 0

edited Apr 06 '16 at 11:06

David Arenburg

91,361
17
137
196

answered Apr 06 '16 at 10:38

Andrie

176,377
47
447
496

This will probably return a list instead of a `data.frame`. You probably will need `pima[] <-...`. And `lapply` probably isn't necessary here at all (unless you are trying to avoid matrix conversions). – David Arenburg Apr 06 '16 at 10:40
@DavidArenburg You're correct about the `pima[] <-`. I edited my answer, thanks. You're probably also correct about not needing `lapply()` but life is too short to remember every possible method that you can use directly on a data frame. I know that `lapply()` works for this type of problem, so I tend to use it this way in my own work... – Andrie Apr 06 '16 at 10:43
This is not a subset of the columns though, it is all of them afaics... – The Unfun Cat Apr 06 '16 at 10:46

score 0 · Answer 2 · edited May 23 '17 at 12:24

0

Using data.table you can try

for (col in c("diastolic","glucose","triceps","insulin", "bmi")) pima[(get(col))==0, (col) := NA]

more details here: How to replace NA values in a table *for selected columns*? data.frame, data.table enter link description here

edited May 23 '17 at 12:24

Community

1
1

answered Apr 06 '16 at 10:38

pauljeba

760
2
10
27

score 0 · Answer 3 · answered Apr 06 '16 at 10:57

Using dplyr, you could do:

# banal function definition 
zero_to_NA <- function(col) {
    # any code that works here
    # I chose this because it is concise and efficient
    `is.na<-`(col, col==0)
}

# Assuming you want to change 0 to NA only in these 3 columns
pima <- pima %>% 
    mutate_each(funs(zero_to_NA), diastolic, glucose, triceps)

Or you could skip the function definition and write directly:

pima <- pima %>% 
    mutate_each(funs(`is.na<-`(., .==0)), 
                diastolic, glucose, triceps)

How can I most efficiently set 0 vals to NA in a subset of columns?

3 Answers3