Remove columns from dataframe where ALL values are NA

Question

I have a data frame where some of the columns contain NA values.

How can I remove columns where all rows contain NA values?

score 198 · Accepted Answer · answered Apr 15 '10 at 09:12

198

Try this:

df <- df[,colSums(is.na(df))<nrow(df)]

answered Apr 15 '10 at 09:12

teucer

6,060
2
26
36

4

This creates an object the size of the old object which is a problem with memory on large objects. Better to use a function to reduce the size. The answer bellow using Filter or using data.table will help your memory usage. – mtelesha Dec 09 '15 at 15:21
4

This does not appear to work with non-numeric columns. – verbamour Feb 01 '17 at 23:17
It changes column name if they are duplicated – Peter.k Jun 26 '18 at 16:42
To do this with non-numeric columns, @mnel's solution using Filter() is a good one. A benchmark of multiple approaches can be found [in this post](https://stackoverflow.com/a/61609515/7753022) – jeromeResearch Nov 24 '20 at 17:47
1

Doesn't seem to work with single-row data frames. – gaspar May 26 '22 at 09:32
2

An alternative might be `df[colSums(!is.na(df)) > 0]` which will return a `data.frame` also in case only 1 column is left and uses `df` only on 2 instead of 3 positions. (Taken from a currently delete post.) – GKi May 10 '23 at 09:38
1

@GKi another option is to set `drop = FALSE` to avoid dropping dimensions: `df <- df[, colSums(is.na(df)) < nrow(df), drop = FALSE]` – ismirsehregal May 10 '23 at 09:42
@ismirsehregal Yes but the command is longer and df is used on 3 instead of 2 positions. So I would prefer using `df[colSums(!is.na(df)) > 0]` compared to `df[, colSums(is.na(df)) < nrow(df), drop = FALSE]`. If you look in the source in `Filter` it also uses `df[ind]` and not `df[,ind, drop = FALSE]`. – GKi May 10 '23 at 09:49
@GKi I wasn't trying to argue that it is the better option. I just wanted to mention the "built-in" way to keep working with a `data.frame` as it is a common pitfall (see @gaspar's comment). Cheers and thanks for your contribution! – ismirsehregal May 10 '23 at 09:58

score 127 · Answer 2 · answered Sep 27 '12 at 05:36

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df), which will be an object the same size as df.

Here are two approaches that are more memory and time efficient

An approach using Filter

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18

Very nice. You could do the same with `data.frame`, though. There's nothing here that really needs `data.table`. The key is the `lapply`, which avoids the copy of the whole object done by `is.na(df)`. +10 for pointing that out. — Matt Dowle, Sep 27 '12 at 09:59
@s_a, `bd1 <- bd[, unlist(lapply(bd, function(x), !all(is.na(x))))]` — mnel, May 22 '14 at 22:55
@mnel I think you need to remove the `,` after `function(x)` - thanks for the example btw — Thieme Hennis, Sep 22 '14 at 08:56

zack · Answer 3 · 2021-01-06T19:06:01.737

93

Update

You can now use select with the where selection helper. select_if is superceded, but still functional as of dplyr 1.0.2. (thanks to @mcstrother for bringing this to attention).

library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select(where(not_all_na))
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select(where(not_any_na))
  x
1 1
2 2
3 3
4 4
5 5

Old Answer

dplyr now has a select_if verb that may be helpful here:

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select_if(not_all_na)
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select_if(not_any_na)
  x
1 1
2 2
3 3
4 4
5 5

edited Jan 06 '21 at 19:06

answered May 14 '18 at 16:40

zack

5,205
1
19
25

5

Came here looking for the `dplyr` solution. Was not disappointed. Thanks! – Andrew Brēza Aug 01 '19 at 16:14
I found this had the issue that it would also delete variables with most but not all values as missing – MBorg May 21 '20 at 13:26
8

`select_if` is now superseded in dplyr, so the last two lines would be `temp %>% select(where(not_all_na))` in the most recent syntax -- although `select_if` still works for now as of dplyr 1.0.2. Also `temp %>% select(where(~!all(is.na(.x))))` works if you don't feel like defining the function on a separate line. – mcstrother Dec 31 '20 at 21:50
1

@mcstrother thank you - that is a very helpful update to my answer. If you'd like to answer it yourself I'll happily roll back the edits. – zack Jan 06 '21 at 19:08
`not_any_na` is not found for me. where does this come from? I have `dplyr` loaded..... – Sky Scraper Nov 03 '21 at 15:02
@SkyScraper it's a function defined in the code provided – zack Nov 04 '21 at 00:59

score 31 · Answer 4 · edited May 10 '23 at 08:57

31

Late to the game but you can also use the janitor package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.

df <- janitor::remove_empty(df, which = "cols")

edited May 10 '23 at 08:57

Moritz Ringler

9,772
9
21
34

answered May 14 '19 at 21:48

André.B

617
8
17

score 18 · Answer 5 · answered Jan 26 '20 at 06:46

18

Another options with purrr package:

library(dplyr)

df <- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))

answered Jan 26 '20 at 06:46

AlexB

3,061
2
17
19

score 17 · Answer 6 · answered Apr 15 '10 at 12:36

Another way would be to use the apply() function.

If you have the data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9

I expected this to be quicker, as the colSum() solution seemed to be doing more work. But on my test set (213 obs. of 1614 variables before, vs. 1377 variables afterwards) it takes exactly 3 times longer. (But +1 for an interesting approach.) — Darren Cook, Feb 17 '12 at 12:01

jpmorris · Answer 7 · 2017-04-13T20:05:27.913

8

df[sapply(df, function(x) all(is.na(x)))] <- NULL

edited Apr 13 '17 at 20:05

answered Apr 13 '17 at 19:53

jpmorris

882
2
10
22

score 7 · Answer 8 · edited May 10 '23 at 08:57

An old question, but I think we can update @mnel's nice answer with a simpler data.table solution:

DT[, .SD, .SDcols = \(x) !all(is.na(x))]

(I'm using the new \(x) lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols.

Speed is equivalent.

microbenchmark::microbenchmark(
  which_unlist = DT[, which(unlist(lapply(DT, \(x) !all(is.na(x))))), with=FALSE],
  sdcols       = DT[, .SD, .SDcols = \(x) !all(is.na(x))],
  times = 2
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval cld
#>  which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776     2   a
#>        sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621     2   a

ok1more · Answer 9 · 2020-06-05T03:47:52.967

You can use Janitor package remove_empty

library(janitor)

df %>%
  remove_empty(c("rows", "cols")) #select either row or cols or both

Also, Another dplyr approach

 library(dplyr) 
 df %>% select_if(~all(!is.na(.)))

OR

df %>% select_if(colSums(!is.na(.)) == nrow(df))

this is also useful if you want to only exclude / keep column with certain number of missing values e.g.

 df %>% select_if(colSums(!is.na(.))>500)

score 1 · Answer 10 · edited May 10 '23 at 08:56

I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.

naColsRemoval = function (DataTable) {
     na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
     DataTable [ , unlist (na.cols) := NULL , with = F]
     }

.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as

score 1 · Answer 11 · answered Mar 26 '20 at 12:04

1

A handy base R option could be colMeans():

df[, colMeans(is.na(df)) != 1]

answered Mar 26 '20 at 12:04

tmfmnk

38,881
4
47
67

score 0 · Answer 12 · answered Aug 20 '20 at 02:29

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:

How to get rid of columns where for ALL rows the value is NA?

First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)

Second, it uses dplyr.

Instead of

df <- df %>% select_if(~all(!is.na(.)))

I find that what works is

df <- df %>% select_if(~!all(is.na(.)))

The point is that the "not" symbol "!" needs to be on the outside of the universal quantifier. I.e. the select_if operator acts on columns. In this case, it selects only those that do not satisfy the criterion

every element is equal to "NA"

score 0 · Answer 13 · edited May 10 '23 at 08:56

0

janitor::remove_constant()

does this very nicely.

edited May 10 '23 at 08:56

Moritz Ringler

9,772
9
21
34

answered Dec 13 '20 at 15:05

J. Lucas McKay

29
2

1

janitor::remove_empty() would be more appropriate here. ?remove_empty = "Remove empty rows and/or columns from a data.frame or matrix" – Thomas Moore Aug 14 '21 at 01:29

score 0 · Answer 14 · answered May 09 '23 at 09:31

0

library(dplyr)

# create a sample data frame
df <- data.frame(x = c(1, 2, NA, 4),
                 y = c(NA, NA, NA, NA),
                 z = c(6, 7, NA, 9))

# remove columns with all NAs
df <- df %>%
  select_if(~!all(is.na(.)))

answered May 09 '23 at 09:31

Airdjan

1
1

Remove columns from dataframe where ALL values are NA

14 Answers14

examples using large data (30 columns, 1e6 rows)

Update

Old Answer

Linked

Related