2

There are several ways to identify and manipulate individual cells with missing data in R, e.g., with complete.cases or even rowSums.

However, I've not been able to find---or figure out myself---an expedient way to select rows that have missing data within a subsetted range of columns.

For example, in dataframe df:

df <- data.frame(D1 = c('A', 'B', 'C', 'D'),
                 D2 = c(NA, 0, 1, 1),
                 V1 = c(11, NA, 33, NA),
                 V2 = c(111, 222, NA, NA)
                 )
df

# D1  D2  V1  V2
#  A  NA  11 111    
#  B   0  NA 222    
#  C   1  33  NA    
#  D   1  NA  NA    

I would like to select all rows that have missing data in both columns V1 and V2, thus selecting row D but not rows B or C (or A).

I have a larger range of columns than given in that toy example, so selecting a set of columns with, e.g., && could make for a long command.

N.B., a similar SO question addresses selecting rows where none are NAs.

wes
  • 113
  • 6

2 Answers2

3

You can try this:

df %>% filter(is.na(V1) & is.na(V2))

OUTPUT

  D1 D2 V1 V2
1  D  1 NA NA
Isaac
  • 382
  • 2
  • 7
  • Thank you (+1). Any ideas on selecting a range of columns as opposed to each column via `&`? – wes Sep 04 '22 at 07:39
1

You can use dplyr::if_all. You can select the columns very flexibly with tidyselect, for instance using :, c, starts_with...

library(dplyr)
df %>% 
  filter(if_all(V1:V2, is.na))

#  D1 D2 V1 V2
#1  D  1 NA NA

Also works (this shows the flexibility of tidyselect):

filter(df, if_all(3:4, is.na))
filter(df, if_all(starts_with("V"), is.na))
filter(df, if_all(c(V1, V2), is.na))
filter(df, if_all((last_col()-1):last_col(), is.na))
filter(df, if_all(num_range("V", 1:2), is.na))
Maël
  • 45,206
  • 3
  • 29
  • 67