Remove rows where all variables are NA using dplyr

Question

I'm having some issues with a seemingly simple task: to remove all rows where all variables are NA using dplyr. I know it can be done using base R (Remove rows in R matrix where all data is NA and Removing empty rows of a data file in R), but I'm curious to know if there is a simple way of doing it using dplyr.

Example:

library(tidyverse)
dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
filter(dat, !is.na(a) | !is.na(b) | !is.na(c))

The filter call above does what I want but it's infeasible in the situation I'm facing (as there is a large number of variables). I guess one could do it by using filter_ and first creating a string with the (long) logical statement, but it seems like there should be a simpler way.

Another way is to use rowwise() and do():

na <- dat %>% 
  rowwise() %>% 
  do(tibble(na = !all(is.na(.)))) %>% 
  .$na
filter(dat, na)

but that does not look too nice, although it gets the job done. Other ideas?

Could do `dat %>% filter(rowSums(is.na(.)) != ncol(.))` perhaps or `dat %>% filter(rowMeans(is.na(.)) < 1)` — David Arenburg, Jan 12 '17 at 10:31
Or ```dat %>% filter(Reduce(`+`, lapply(., is.na)) != ncol(.))``` which all are basically implementations of base R combined with `filter`. I don't think there is anything built-in in `tidyverse` to do this very effciently — David Arenburg, Jan 12 '17 at 10:37
Though, Hadley probably would recommend working on a long format, something like `dat %>% mutate(indx = row_number()) %>% gather(var, val, -indx) %>% group_by(indx) %>% filter(sum(is.na(val)) != n()) %>% spread(var, val)` — David Arenburg, Jan 12 '17 at 10:46
@DavidArenburg Thanks, these are nice suggestions! I had a feeling there would be some kind of `all` function, like `filter(dat, !all_na())`, but I would guess not. — hejseb, Jan 12 '17 at 10:47
@DavidArenburg That would be nice. I did a quick benchmarking out of curiosity of the different approaches and added it to the post. — hejseb, Jan 12 '17 at 10:59
You could post that as ans answer and benchmark on a bit bigger data set. I would guess the `Reduce` approach will become less efficient if there are many columns. — David Arenburg, Jan 12 '17 at 11:09
@DavidArenburg I went with 40 now and around 100,000 rows, but it's still doing well! — hejseb, Jan 12 '17 at 11:17
Welp, I guess converting a data.frame to a matrix is more expensive than looping over a list. — David Arenburg, Jan 12 '17 at 11:20

MarkusN · Accepted Answer · 2021-05-17T10:11:22.940

84

Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:

# dplyr 0.7.0
dat %>% filter_all(any_vars(!is.na(.)))

Using @hejseb benchmarking algorithm it appears that this solution is as efficient as f4.

UPDATE:

Since dplyr 1.0.0 the above scoped verbs are superseded. Instead the across function family was introduced, which allows to perform a function on multiple (or all) columns. Filtering rows with at least one column being not NA looks now like this:

# dplyr 1.0.0
dat %>% filter(if_any(everything(), ~ !is.na(.)))

edited May 17 '21 at 10:11

answered May 02 '18 at 09:36

MarkusN

3,051
1
18
26

7

this is the most intuitive solution to remove the all-na rows in my opinion. in addition, worthwhile to mention for the positive case when you want to detect the all-na rows, you must use all_vars() instead of any_vars() as in `dat %>% filter_all(all_vars(is.na(.)))` – Agile Bean Oct 17 '18 at 08:57
1

In dplyr 1.0 `filter_all` and `any_vars` have both been superseded and `any_vars` has no replacement that I know of. The option advised in the [colwise vignette](https://dplyr.tidyverse.org/articles/colwise.html) is to define your own helper like `rowAny <- function(x) rowSums(x) > 0` so that the above solution becomes `dat %>% filter(rowAny(across(everything(), ~ !is.na(.x))))` – Callum Savage Jun 03 '20 at 22:16
Another alternative in dplyr 1.0 could be something like `dat %>% rowwise() %>% filter(sum(is.na(c_across(everything()))) != ncol(.)) %>% ungroup()`, though there's probably a more elegant way of achieving this. – Callum Savage Jun 03 '20 at 22:17
1

The colwise vignette (now?) mentions the following approach: `dat %>% filter(if_any(everything(), ~ !is.na(.x)))`, which "keeps the rows where the predicate is true for *at least one* selected column". (See answer from shosaco way below) – Thomas K Apr 28 '21 at 07:57
Thanks a lot!! I just do not understand why we can refer to the dot `.` here and do not need the `.x` in the anonymus function – Lenn Jan 05 '23 at 09:41

score 22 · Answer 2 · answered Oct 25 '20 at 10:41

22

I would suggest to use the wonderful janitor package here. Janitor is very user-friendly:

janitor::remove_empty(dat, which = "rows")

answered Oct 25 '20 at 10:41

mharinga

1,708
10
23

score 17 · Answer 3 · answered Jan 12 '17 at 11:16

Benchmarking

@DavidArenburg suggested a number of alternatives. Here's a simple benchmarking of them.

library(tidyverse)
library(microbenchmark)

n <- 100
dat <- tibble(a = rep(c(1, 2, NA), n), b = rep(c(1, 1, NA), n))

f1 <- function(dat) {
  na <- dat %>% 
    rowwise() %>% 
    do(tibble(na = !all(is.na(.)))) %>% 
    .$na
  filter(dat, na)
}

f2 <- function(dat) {
  dat %>% filter(rowSums(is.na(.)) != ncol(.))
}

f3 <- function(dat) {
  dat %>% filter(rowMeans(is.na(.)) < 1)
}

f4 <- function(dat) {
  dat %>% filter(Reduce(`+`, lapply(., is.na)) != ncol(.))
}

f5 <- function(dat) {
  dat %>% mutate(indx = row_number()) %>% gather(var, val, -indx) %>% group_by(indx) %>% filter(sum(is.na(val)) != n()) %>% spread(var, val) 
}

# f1 is too slow to be included!
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))

Using Reduce and lapply appears to be the fastest:

> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
Unit: microseconds
 expr        min          lq       mean      median         uq        max neval
   f2    909.495    986.4680   2948.913   1154.4510   1434.725 131159.384   100
   f3    946.321   1036.2745   1908.857   1221.1615   1805.405   7604.069   100
   f4    706.647    809.2785   1318.694    960.0555   1089.099  13819.295   100
   f5 640392.269 664101.2895 692349.519 679580.6435 709054.821 901386.187   100

Using a larger data set 107,880 x 40:

dat <- diamonds
# Let every third row be NA
dat[seq(1, nrow(diamonds), 3), ]  <- NA
# Add some extra NA to first column so na.omit() wouldn't work
dat[seq(2, nrow(diamonds), 3), 1] <- NA
# Increase size
dat <- dat %>% 
  bind_rows(., .) %>%
  bind_cols(., .) %>%
  bind_cols(., .)
# Make names unique
names(dat) <- 1:ncol(dat)
microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))

f5 is too slow so it is also excluded. f4 seems to do relatively better than before.

> microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
Unit: milliseconds
 expr      min       lq      mean    median       uq      max neval
   f2 34.60212 42.09918 114.65140 143.56056 148.8913 181.4218   100
   f3 35.50890 44.94387 119.73744 144.75561 148.8678 254.5315   100
   f4 27.68628 31.80557  73.63191  35.36144 137.2445 152.4686   100

I wonder if using `purrr` functions in `f4` would affect speed? `filter(reduce(map., is.na), \`+\`) != ncol(.))` is arguably "tidier" — ClaytonJY, Jun 28 '17 at 15:59
I get roughly identical performance on my machine on the same big-diamonds dataset between original `f4` and the purr-ified version I suggest above. — ClaytonJY, Jun 28 '17 at 16:09

jiggunjer · Answer 4 · 2020-07-26T09:00:16.097

Starting with dyplr 1.0, the colwise vignette gives a similar case as an example:

filter(across(everything(), ~ !is.na(.x))) #Remove rows with *any* NA

We can see it uses the same implicit "& logic" filter uses with multiple expressions. So the following minor adjustment selects all NA rows:

filter(across(everything(), ~ is.na(.x))) #Remove rows with *any* non-NA

But the question asks for the inverse set: Remove rows with all NA.

We can do a simple setdiff using the previous, or
we can use the fact that across returns a logical tibble and filter effectively does a row-wise all() (i.e. &).

Eg:

rowAny = function(x) apply(x, 1, any)
anyVar = function(fcn) rowAny(across(everything(), fcn)) #make it readable
df %<>% filter(anyVar(~ !is.na(.x))) #Remove rows with *all* NA

Or:

filterout = function(df, ...) setdiff(df, filter(df, ...))
df %<>% filterout(across(everything(), is.na)) #Remove rows with *all* NA

Or even combinine the above 2 to express the first example more directly:

df %<>% filterout(anyVar(~ is.na(.x))) #Remove rows with *any* NA

In my opinion, the tidyverse filter function would benefit from a parameter describing the 'aggregation logic'. It could default to "all" and preserve behavior, or allow "any" so we wouldn't need to write anyVar-like helper functions.

thanks, the filterout function using setdiff works fine. Just need to be careful, as it will also drop any duplicate rows. To avoid it, we can use anti_join from dplyr `filterout = function(df, ...) anti_join(df, filter(df, ...))` — radhikesh93, Aug 18 '20 at 22:06

alex.franco · Answer 5 · 2020-10-25T18:43:23.273

6

The solution using dplyr 1.0 is simple and does not require helper functions, you just need to add a negation in the right place.

dat %>% filter(!across(everything(), is.na))

edited Oct 25 '20 at 18:43

answered Oct 25 '20 at 10:11

alex.franco

77
1
3

1

Even shorter: dat %>% filter(!across(everything(), is.na)) – mharinga Oct 25 '20 at 10:45
1

@mharinga Yes, I was trying to be explicit, but I will edit the answer with your suggestion – alex.franco Oct 25 '20 at 18:44
3

this doesn't answer the question asked but removes all rows where *any* column contains an NA not only those where *all* columns contain an NA – Latrunculia Dec 08 '20 at 09:18

shosaco · Answer 6 · 2021-05-26T16:21:11.360

6

dplyr 1.0.4 introduced the if_any() and if_all() functions:

dat %>% filter(if_any(everything(), ~!is.na(.)))

or, more verbose:

dat %>% filter(if_any(everything(), purrr::negate(is.na)))

"Take dat and keep all rows where any entry is non-NA"

edited May 26 '21 at 16:21

answered Mar 09 '21 at 14:09

shosaco

5,915
1
30
48

score 2 · Answer 7 · answered Mar 04 '18 at 00:30

Here's another solution that uses purrr::map_lgl() and tidyr::nest():

library(tidyverse)

dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))

any_not_na <- function(x) {
  !all(map_lgl(x, is.na))
}


dat_cleaned <- dat %>%
  rownames_to_column("ID") %>%
  group_by(ID) %>%
  nest() %>%
  filter(map_lgl(data, any_not_na)) %>%
  unnest() %>%
  select(-ID)
## Warning: package 'bindrcpp' was built under R version 3.4.2

dat_cleaned
## # A tibble: 2 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1    1.    1.    2.
## 2    2.   NA    NA

I doubt this approach will be able to compete with the benchmarks in @hejseb's answer, but I think it does a pretty good job at showing how the nest %>% map %>% unnest pattern works and users can run through it line-by-line to figure out what's going on.

score 1 · Answer 8 · answered May 16 '22 at 18:44

1

You can use the function complete.cases from dplyr using the dot (.) for specify the previous dataframe on the chain.

library(dplyr)
df = data.frame(
    x1 = c(1,2,3,NA),
    x2 = c(1,2,NA,5),
    x3 = c(NA,2,3,5)
)
df %>%
   filter(complete.cases(.))

  x1 x2 x3
1  2  2  2

answered May 16 '22 at 18:44

rubengavidia0x

501
1
5
18

This is what I wanted, don't know about everyone! Ta. – John May 29 '22 at 08:00

score 0 · Answer 9 · answered Mar 31 '21 at 07:00

I a neat solution what works in dplyr 1.0.1 is to use rowwise()

dat %>%
  rowwise() %>%
  filter(!all(is.na(across(everything())))) %>%
  ungroup()

very similar to @Callum Savage 's comment on the top post but I missed it on the first pass, and without the sum()

score 0 · Answer 10 · answered Dec 07 '21 at 10:39

0

(tidyverse 1.3.1)

data%>%rowwise()%>%
filter(!all(is.na(c_across(is.numeric))))

data%>%rowwise()%>%
filter(!all(is.na(c_across(starts_with("***")))))

answered Dec 07 '21 at 10:39

michael

1

1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 07 '21 at 12:42

Remove rows where all variables are NA using dplyr

10 Answers10

Benchmarking

Linked

Related