Piping the removal of empty columns using dplyr

Question

I have a data frame of participant questionnaire responses in wide format, with each column representing a particular question/item.

The data frame looks something like this:

id <- c(1, 2, 3, 4)
Q1 <- c(NA, NA, NA, NA)
Q2 <- c(1, "", 4, 5)
Q3 <- c(NA, 2, 3, 4)
Q4 <- c("", "", 2, 2)
Q5 <- c("", "", "", "")
df <- data.frame(id, Q1, Q2, Q3, Q4, Q5)

I want R to remove columns that has all values in each of its rows that are either (1) NA or (2) blanks. Therefore, I do not want column Q1 (which comprises entirely of NAs) and column Q5 (which comprises entirely of blanks in the form of "").

According to this thread, I am able to use the following to remove columns that comprise entirely of NAs:

df[, !apply(is.na(df), 2, all]

However, that solution does not address blanks (""). As I am doing all of this in a dplyr pipe, could someone also explain how I could incorporate the above code into a dplyr pipe?

At this moment, my dplyr pipe looks like the following:

df <- df %>%
    select(relevant columns that I need)

After which, I'm stuck here and am using the brackets [] to subset the non-NA columns.

Thanks! Much appreciated.

I've updated my post to reflect what my dplyr pipe looks like right now. — DTYK, Mar 20 '18 at 01:19

Ronak Shah · Accepted Answer · 2019-07-06T04:55:34.853

32

We can use a version of select_if

library(dplyr)
df %>%
   select_if(function(x) !(all(is.na(x)) | all(x=="")))

#  id Q2 Q3 Q4
#1  1  1 NA   
#2  2     2   
#3  3  4  3  2
#4  4  5  4  2

Or without using an anonymous function call

df %>% select_if(~!(all(is.na(.)) | all(. == "")))

You can also modify your apply statement as

df[!apply(df, 2, function(x) all(is.na(x)) | all(x==""))]

Or using colSums

df[colSums(is.na(df) | df == "") != nrow(df)]

and inverse

df[colSums(!(is.na(df) | df == "")) > 0]

edited Jul 06 '19 at 04:55

answered Mar 20 '18 at 01:23

Ronak Shah

377,200
20
156
213

Thanks! What's the difference between select and select_if – DTYK Mar 20 '18 at 01:30
1

@DTYK `select` expects names of column to be selected, whereas `select_if` expects a logical vector in which column would be selected only if the value is `TRUE`. – Ronak Shah Mar 20 '18 at 01:33

score 16 · Answer 2 · answered Aug 05 '20 at 09:48

With dplyr version 1.0, you can use the helper function where() inside select instead of needing to use select_if.

library(tidyverse)
df <- data.frame(id = c(1, 2, 3, 4),
                 Q1 = c(1, "", 4, 5), 
                 Q2 = c(NA, NA, NA, NA),
                 Q3 = c(NA, 2, 3, 4), 
                 Q4 = c("", "", 2, 2), 
                 Q5 = c("", "", "", ""))

df %>% select(where(~ !(all(is.na(.)) | all(. == ""))))
#>   id Q1 Q3 Q4
#> 1  1  1 NA   
#> 2  2     2   
#> 3  3  4  3  2
#> 4  4  5  4  2

score 5 · Answer 3 · answered Mar 20 '18 at 01:20

5

You can use select_if to do this.

Method:

col_selector <- function(x) {
  return(!(all(is.na(x)) | all(x == "")))
}


df %>% select_if(col_selector)

Output:

  id Q2 Q3 Q4
1  1  1 NA   
2  2     2   
3  3  4  3  2
4  4  5  4  2

answered Mar 20 '18 at 01:20

Nik Muhammad Naim

538
4
11

Piping the removal of empty columns using dplyr

3 Answers3

Linked