1

I would like to iterate over columns in dataframe and for every column if the number of NAs is bigger than 50% of all entries I would like to remove that column from the dataframe. So far I have something like this but it doesn't work:

for (i in names(df_r)) {
    if (sum(is.na(df_r[,i]))/length(df_r) > 0.5) {
        df_r <- df_r[, -i]
        }
    }

I am more of a python guy and I am learning R so I might be mixing syntax here.

Blazej Kowalski
  • 367
  • 1
  • 6
  • 16
  • 2
    just `df_r[colMeans(is.na(df_r)) < 0.5]` – Jaap Feb 27 '18 at 10:16
  • 2
    also: please see how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610); that makes it a lot easier for other to answer – Jaap Feb 27 '18 at 10:18

5 Answers5

3

For loops in R are generally not very fast and should be avoided. In this case, you can use dplyr to make it fast and tidy:

library(dplyr)

df_r %>% 
  select_if(function(x) { ! sum(is.na(x)) / length(x) > 0.5 })
clemens
  • 6,653
  • 2
  • 19
  • 31
2

You are much better off using more vector-based calculations vice the more literal for loop.

na50 <- sapply(df_r, function(x) sum(is.na(x))) / nrow(df_r)
df_r[na50 > 0.5] <- NULL
# or
df_r <- df_r[na50 <= 0.5]
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Hmm i modified your solution to: na <- sapply(df_r, function(x) {sum(is.na(x)) / nrow(df_r)}) remove2 <- which(na>0.5) df_r2 <- subset(df_r, select = -remove2), because otherwise I got the following error: Error in `[.data.table`(x, i, which = TRUE) : i evaluates to a logical vector length 1371 but there are 1179 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle. – Blazej Kowalski Feb 27 '18 at 10:40
2

I would use lapply to loop over the data.frame columns:

DF <- data.frame(x = c(1, NA, 2), y = c("a", NA, NA))
DF[] <- lapply(DF, function(x) if (mean(is.na(x)) <= 0.5) x else NULL)
#   x
#1  1
#2 NA
#3  2
Roland
  • 127,288
  • 10
  • 191
  • 288
0

Check this:

## for loop solution
for(i in names(dt))
{
    len <- nrow(dt)
    if(sum(is.na(dt[[i]])) > (len/2)) dt[[i]] <- NULL
    else next
}

## non for loop solution
cols <- colSums(is.na(dt))
cols <- names(cols[cols > (nrow(dt)/2)])
dt[[cols]] <- NULL
YOLO
  • 20,181
  • 5
  • 20
  • 40
0

It's basically one line:

df_r <- df_r[, apply(df_r, MARGIN = 2, FUN = function(x) sum(is.na(x))/length(x) <= 0.5)]

apply applies the function (specified after FUN =) to each column (specified by MARGIN = 2). The function checks whether the proportion of NAs is bigger smaller or equal to 0.5 and returns a logical vector. This vector then selects only the columns of df_r which have the small NA proportion.

kath
  • 7,624
  • 17
  • 32