Code to filter out problematic 0 and NA entries

Question

Total newbie here, I totally apologize if/when at any point I sound like a complete idiot.

I am working in RStudio. I have imported a data file from excel. It has several columns with health information such as age, blood pressure, BMI, and a couple others. I need to remove the entries with 0s in a couple of the columns (you can't have 0 BMI or blood pressure) I also need to remove all of the entries with NAs.

I am stuck on what to do. I have tried the na.omit function, but afterwords I try doing things like mean() median() and it gives me the message "argument is not numeric or logical: returning NA" which makes no sense. I thought the NAs were supposed to be removed.

Please help. I need help cleaning this data.

Please see this post https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — B Williams, Sep 13 '18 at 20:34
Here is a great resource for starting with R . http://r4ds.had.co.nz/index.html. As per the guidelines for posting, please include sample code of how far you were able to get. — djchapman, Sep 13 '18 at 20:35
In case a row has BMI == 0 but blood_pressure != 0 ( or viceversa), or BMI == NA but blood_pressure != NA (or viceversa) are you going to remove it? — s__, Sep 13 '18 at 20:36

score 0 · Answer 1 · answered Sep 13 '18 at 20:38

Usually it's not good to remove the NA's because it may be NA for one column, but not the other, so you may exclude the wrong thing.

With the stats library, you can use the complete.cases(df) to remove all NA.

To change 0's to NA, you can do:

df[ df == 0] <- NA

Also if you want to ignore NA's while doing calculations you can do

median(df$col,na.rm = TRUE)

This will remove the NA from the calculations and you won't get NA as an output.

score 0 · Answer 2 · answered Sep 13 '18 at 20:40

0

A tidyverse solution might look like this. Tidyverse is a set of packages developed by the R Studio team.

library(tidyverse)

data <- data %>%
  filter(BMI != 0, BloodPressure != 0, col != NA)

answered Sep 13 '18 at 20:40

djchapman

205
1
9

score 0 · Answer 3 · answered Sep 13 '18 at 20:46

First of all, you have to make sure that the columns you are interested in are numeric not character because direct import from excel files could produce unexpected column types. To do so use the function class(data_name$column_name).

Character variables cannot be handled with mean() and median() so you have to convert them first to numeric using

data_name$column_name <- as.numeric(data_name$column_name)

After that you can replace zeros with NA using ifelse function:

data_name$column_name <- ifelse(data_name$column_name == 0, NA, data_name$column_name)

Then, you can compute the mean and median in the normal way using the argument na.rm to remove missing values (NA):

mean_BMI <- mean(data_name$BMI, na.rm = TRUE)

Code to filter out problematic 0 and NA entries

3 Answers3