200

I have a data frame where some of the columns contain NA values.

How can I remove columns where all rows contain NA values?

user438383
  • 5,716
  • 8
  • 28
  • 43
Gnark
  • 4,080
  • 7
  • 33
  • 44

14 Answers14

198

Try this:

df <- df[,colSums(is.na(df))<nrow(df)]
teucer
  • 6,060
  • 2
  • 26
  • 36
  • 4
    This creates an object the size of the old object which is a problem with memory on large objects. Better to use a function to reduce the size. The answer bellow using Filter or using data.table will help your memory usage. – mtelesha Dec 09 '15 at 15:21
  • 4
    This does not appear to work with non-numeric columns. – verbamour Feb 01 '17 at 23:17
  • It changes column name if they are duplicated – Peter.k Jun 26 '18 at 16:42
  • To do this with non-numeric columns, @mnel's solution using Filter() is a good one. A benchmark of multiple approaches can be found [in this post](https://stackoverflow.com/a/61609515/7753022) – jeromeResearch Nov 24 '20 at 17:47
  • 1
    Doesn't seem to work with single-row data frames. – gaspar May 26 '22 at 09:32
  • 2
    An alternative might be `df[colSums(!is.na(df)) > 0]` which will return a `data.frame` also in case only 1 column is left and uses `df` only on 2 instead of 3 positions. (Taken from a currently delete post.) – GKi May 10 '23 at 09:38
  • 1
    @GKi another option is to set `drop = FALSE` to avoid dropping dimensions: `df <- df[, colSums(is.na(df)) < nrow(df), drop = FALSE]` – ismirsehregal May 10 '23 at 09:42
  • @ismirsehregal Yes but the command is longer and df is used on 3 instead of 2 positions. So I would prefer using `df[colSums(!is.na(df)) > 0]` compared to `df[, colSums(is.na(df)) < nrow(df), drop = FALSE]`. If you look in the source in `Filter` it also uses `df[ind]` and not `df[,ind, drop = FALSE]`. – GKi May 10 '23 at 09:49
  • @GKi I wasn't trying to argue that it is the better option. I just wanted to mention the "built-in" way to keep working with a `data.frame` as it is a common pitfall (see @gaspar's comment). Cheers and thanks for your contribution! – ismirsehregal May 10 '23 at 09:58
127

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18 
mnel
  • 113,303
  • 27
  • 265
  • 254
  • 7
    Very nice. You could do the same with `data.frame`, though. There's nothing here that really needs `data.table`. The key is the `lapply`, which avoids the copy of the whole object done by `is.na(df)`. +10 for pointing that out. – Matt Dowle Sep 27 '12 at 09:59
  • 1
    How would you do it with a data.frame? @matt-dowle – s_a May 22 '14 at 15:52
  • 10
    @s_a, `bd1 <- bd[, unlist(lapply(bd, function(x), !all(is.na(x))))]` – mnel May 22 '14 at 22:55
  • 6
    @mnel I think you need to remove the `,` after `function(x)` - thanks for the example btw – Thieme Hennis Sep 22 '14 at 08:56
  • 1
    Can you do it faster with := or with a set() ? – skan Jul 18 '16 at 13:50
93

Update

You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).

library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select(where(not_all_na))
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select(where(not_any_na))
  x
1 1
2 2
3 3
4 4
5 5

Old Answer

dplyr now has a select_if verb that may be helpful here:

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select_if(not_all_na)
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select_if(not_any_na)
  x
1 1
2 2
3 3
4 4
5 5
zack
  • 5,205
  • 1
  • 19
  • 25
  • 5
    Came here looking for the `dplyr` solution. Was not disappointed. Thanks! – Andrew Brēza Aug 01 '19 at 16:14
  • I found this had the issue that it would also delete variables with most but not all values as missing – MBorg May 21 '20 at 13:26
  • 8
    `select_if` is now superseded in dplyr, so the last two lines would be `temp %>% select(where(not_all_na))` in the most recent syntax -- although `select_if` still works for now as of dplyr 1.0.2. Also `temp %>% select(where(~!all(is.na(.x))))` works if you don't feel like defining the function on a separate line. – mcstrother Dec 31 '20 at 21:50
  • 1
    @mcstrother thank you - that is a very helpful update to my answer. If you'd like to answer it yourself I'll happily roll back the edits. – zack Jan 06 '21 at 19:08
  • `not_any_na` is not found for me. where does this come from? I have `dplyr` loaded..... – Sky Scraper Nov 03 '21 at 15:02
  • @SkyScraper it's a function defined in the code provided – zack Nov 04 '21 at 00:59
31

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.

df <- janitor::remove_empty(df, which = "cols")
Moritz Ringler
  • 9,772
  • 9
  • 21
  • 34
André.B
  • 617
  • 8
  • 17
18

Another options with purrr package:

library(dplyr)

df <- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))
AlexB
  • 3,061
  • 2
  • 17
  • 19
17

Another way would be to use the apply() function.

If you have the data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9
mropa
  • 11,562
  • 10
  • 33
  • 29
  • 3
    I expected this to be quicker, as the colSum() solution seemed to be doing more work. But on my test set (213 obs. of 1614 variables before, vs. 1377 variables afterwards) it takes exactly 3 times longer. (But +1 for an interesting approach.) – Darren Cook Feb 17 '12 at 12:01
8
df[sapply(df, function(x) all(is.na(x)))] <- NULL
jpmorris
  • 882
  • 2
  • 10
  • 22
7

An old question, but I think we can update @mnel's nice answer with a simpler data.table solution:

DT[, .SD, .SDcols = \(x) !all(is.na(x))]

(I'm using the new \(x) lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols.

Speed is equivalent.

microbenchmark::microbenchmark(
  which_unlist = DT[, which(unlist(lapply(DT, \(x) !all(is.na(x))))), with=FALSE],
  sdcols       = DT[, .SD, .SDcols = \(x) !all(is.na(x))],
  times = 2
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval cld
#>  which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776     2   a
#>        sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621     2   a
Moritz Ringler
  • 9,772
  • 9
  • 21
  • 34
Grant
  • 1,536
  • 13
  • 25
4

You can use Janitor package remove_empty

library(janitor)

df %>%
  remove_empty(c("rows", "cols")) #select either row or cols or both

Also, Another dplyr approach

 library(dplyr) 
 df %>% select_if(~all(!is.na(.)))

OR

df %>% select_if(colSums(!is.na(.)) == nrow(df))

this is also useful if you want to only exclude / keep column with certain number of missing values e.g.

 df %>% select_if(colSums(!is.na(.))>500)
ok1more
  • 779
  • 6
  • 15
1

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.

naColsRemoval = function (DataTable) {
     na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
     DataTable [ , unlist (na.cols) := NULL , with = F]
     }

.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as

Moritz Ringler
  • 9,772
  • 9
  • 21
  • 34
1

A handy base R option could be colMeans():

df[, colMeans(is.na(df)) != 1]
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
0

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:

How to get rid of columns where for ALL rows the value is NA?

First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)

Second, it uses dplyr.

Instead of

df <- df %>% select_if(~all(!is.na(.)))

I find that what works is

df <- df %>% select_if(~!all(is.na(.)))

The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if operator acts on columns. In this case, it selects only those that do not satisfy the criterion

every element is equal to "NA"

0
janitor::remove_constant() 

does this very nicely.

Moritz Ringler
  • 9,772
  • 9
  • 21
  • 34
  • 1
    janitor::remove_empty() would be more appropriate here. ?remove_empty = "Remove empty rows and/or columns from a data.frame or matrix" – Thomas Moore Aug 14 '21 at 01:29
0
library(dplyr)

# create a sample data frame
df <- data.frame(x = c(1, 2, NA, 4),
                 y = c(NA, NA, NA, NA),
                 z = c(6, 7, NA, 9))

# remove columns with all NAs
df <- df %>%
  select_if(~!all(is.na(.)))
Airdjan
  • 1
  • 1