4

I want to remove rows from a data.frame where all cols are NA. But I'd like to keep rows that have some values NA.

I know how to do this with base R but I'm trying to figure out how to make it work with tidyverse. I'm trying the across operator.

library(tidyverse)

teste <- data.frame(a = c(1,NA,3, NA), b = c(NA, NA, 3, 4), c = c(1, NA, 3, 4))

teste
#>    a  b  c
#> 1  1 NA  1
#> 2 NA NA NA
#> 3  3  3  3
#> 4 NA  4  4
# I whant to remove rows where all values are NA
# that is, remove only line 2

# here I can get the lines with all values NA
teste %>%
  filter(across(a:c, is.na))
#>    a  b  c
#> 1 NA NA NA

# If I negate the filter, it does not work
# the last line (NA, 4, 4) is missing
teste %>%
  filter(!across(a:c, is.na))
#>   a  b c
#> 1 1 NA 1
#> 2 3  3 3

# This is what I'm expecting
# a  b  c
# 1 NA  1
# 3  3  3
# NA  4  4

# Using base I can do this with
teste[apply(teste, 1, function(x) sum(is.na(x))) < 3,]
#>    a  b c
#> 1  1 NA 1
#> 3  3  3 3
#> 4 NA  4 4

How can I do this using tidyverse?

Created on 2020-08-18 by the reprex package (v0.3.0)

Daniel
  • 462
  • 3
  • 13
  • Does this answer your question? [Remove rows with all or some NAs (missing values) in data.frame](https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame) – Duck Aug 19 '20 at 00:28
  • It did help find a way to do it using Base R. But I had a hard time finding an answer using filter. Maybe I did not grasp all the information in those answers. Anyway, I already accepted an answer here, but ended up using R base in my code... – Daniel Aug 19 '20 at 01:39

3 Answers3

3

We can use base R

teste[rowSums(!is.na(teste)) >0,]
#   a  b c
#1  1 NA 1
#3  3  3 3
#4 NA  4 4

Or using apply and any

teste[apply(!is.na(teste), 1, any),]

which can be also used within filter

teste %>%
      filter(rowSums(!is.na(.)) >0)

Or using c_across from dplyr, we can directly remove the rows with all NA

library(dplyr)
teste %>% 
    rowwise %>% 
    filter(!all(is.na(c_across(everything()))))
# A tibble: 3 x 3
# Rowwise: 
#      a     b     c
#  <dbl> <dbl> <dbl>
#1     1    NA     1
#2     3     3     3
#3    NA     4     4

NOTE: filter_all is getting deprecated

akrun
  • 874,273
  • 37
  • 540
  • 662
  • I saw a similar question on Twitter and gave something similar to the `rowSums` solution. However, I've been wondering how fast this solution would be for very large datasets since it seems too "expensive" to check with `is.na` on the entire dataset first. – NelsonGon Aug 19 '20 at 02:13
  • 1
    @NelsonGon If memory is not a constraint, then it should be fast. Otherwise, for really big dataset, it depends on the number of columns, etc. It may be slightly faster with `tidyft` if the input dataset is fst – akrun Aug 19 '20 at 19:01
2

Previously in dplyr, you could use filter_all (for all columns)/filter_at (for specific columns) which had any_vars :

library(dplyr)

teste %>% filter_all(any_vars(!is.na(.)))

However, across does not have direct replacement of any_vars so you can use this with Reduce :

teste %>% filter(Reduce(`|`, across(.fns = Negate(is.na))))
#   a  b c
#1  1 NA 1
#2  3  3 3
#3 NA  4 4
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

Using data.table, you can produce the same outcome.

teste2 <- teste[-which(is.na(teste$a)&is.na(teste$b)&is.na(teste$c)),]