46

I'd like to remove rows corresponding to a particular combination of variables from my data frame.

Here's a dummy data :

father<- c(1, 1, 1, 1, 1)
mother<- c(1, 1, 1, NA, NA) 
children <- c(NA, NA, 2, 5, 2) 
cousins   <- c(NA, 5, 1, 1, 4) 


dataset <- data.frame(father, mother, children, cousins)  
dataset


father  mother  children cousins
1      1       NA      NA
1      1       NA       5
1      1        2       1
1     NA        5       1
1     NA        2       4

I want to filter this row :

  father  mother  children cousins
    1      1       NA      NA

I can do it with :

test <- dataset %>% 
filter(father==1 & mother==1) %>%
filter (is.na(children)) %>%
filter (is.na(cousins))
test  

My question : I have many columns like grand father, uncle1, uncle2, uncle3 and I want to avoid something like that:

  filter (is.na(children)) %>%
  filter (is.na(cousins)) %>%
  filter (is.na(uncle1)) %>%
  filter (is.na(uncle2)) %>%
  filter (is.na(uncle3)) 
  and so on...

How can I use dplyr to say filter all the column with na (except father==1 & mother==1)

Joe
  • 8,073
  • 1
  • 52
  • 58
Wilcar
  • 2,349
  • 2
  • 21
  • 48
  • 2
    IMO you should convert your data to long format (http://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format) – talat May 12 '17 at 13:33
  • Possible duplicate of http://stackoverflow.com/questions/7381455/filtering-a-data-frame-by-values-in-a-column – akrun May 12 '17 at 14:43

6 Answers6

80

A possible dplyr(0.5.0.9004 <= version < 1.0) solution is:

# > packageVersion('dplyr')
# [1] ‘0.5.0.9004’

dataset %>%
    filter(!is.na(father), !is.na(mother)) %>%
    filter_at(vars(-father, -mother), all_vars(is.na(.)))

Explanation:

  • vars(-father, -mother): select all columns except father and mother.
  • all_vars(is.na(.)): keep rows where is.na is TRUE for all the selected columns.

note: any_vars should be used instead of all_vars if rows where is.na is TRUE for any column are to be kept.


Update (2020-11-28)

As the _at functions and vars have been superseded by the use of across since dplyr 1.0, the following way (or similar) is recommended now:

dataset %>%
    filter(across(c(father, mother), ~ !is.na(.x))) %>%
    filter(across(c(-father, -mother), is.na))

See more example of across and how to rewrite previous code with the new approach here: Colomn-wise operatons or type vignette("colwise") in R after installing the latest version of dplyr.

mt1022
  • 16,834
  • 5
  • 48
  • 71
32

dplyr >= 1.0.4

If you're using dplyr version >= 1.0.4 you really should use if_any or if_all, which specifically combines the results of the predicate function into a single logical vector making it very useful in filter. The syntax is identical to across, but these verbs were added to help fill this need: if_any/if_all.

library(dplyr)

dataset %>% 
  filter(if_all(-c(father, mother), ~ is.na(.)), if_all(c(father, mother), ~ !is.na(.)))

Here I have written out the variable names, but you can use any tidy selection helper to specify variables (e.g., column ranges by name or location, regular expression matching, substring matching, starts with/ends with, etc.).

Output

  father mother children cousins
1      1      1       NA      NA
LMc
  • 12,577
  • 3
  • 31
  • 43
  • 1
    Regarding the anonymous function, can we replace `~ is.na(.)` by `~ is.na(.x)` here ? – Julien Jul 11 '22 at 12:29
  • @Julien Yes, that will work. Please see this [SO question](https://stackoverflow.com/questions/56532119/dplyr-piping-data-difference-between-and-x) regarding this topic. – LMc Jul 11 '22 at 15:20
  • 1
    This new pair of helper functions are more flexible and should be preferred! – mt1022 Dec 13 '22 at 15:01
7

A dplyr solution:

test <- dataset %>% 
  filter(father == 1 & mother == 1 & rowSums(is.na(.[,3:4])) == 2)

Where '2' is the number of columns that should be NA.

This gives:

> test
  father mother children cousins
1      1      1       NA      NA

You can apply this logic in base R as well:

dataset[dataset$father == 1 & dataset$mother == 1 & rowSums(is.na(dataset[,3:4])) == 2,]
Jaap
  • 81,064
  • 34
  • 182
  • 193
7

None of the answers seems to be an adaptable solution. I think the intention is not to list all the variables and values to filter the data.

One easy way to achieve this is through merging. If you have all the conditions in df_filter then you can do this:

df_results = df_filter %>% left_join(df_all)
Feng Jiang
  • 1,776
  • 19
  • 25
  • it is unclear what "all the conditions in df_filter " means. I would suspect all conditions in dplyr code, but your code example suggests otherwise. Please clarify – Agile Bean Jan 08 '21 at 08:59
  • It means all cells meet the condition. Like the example given in the question under "I want to filter this row :". – Feng Jiang Jan 16 '21 at 00:02
4

Here is a base R method using two Reduce functions and [ to subset.

keepers <- Reduce(function(x, y) x == 1 & y == 1, dataset[, 1:2]) &
           Reduce(function(x, y) is.na(x) & is.na(y), dataset[, 3:4])
keepers
[1]  TRUE FALSE FALSE FALSE FALSE

Each Reduce consecutively takes the variables provided and performs a logical check. The two results are connected with an &. The second argument to the Reduce functions can be adjusted to include whatever variables in the data.frame that you want.

Then use the logical vector to subset

dataset[keepers,]
  father mother children cousins
1      1      1       NA      NA
lmo
  • 37,904
  • 9
  • 56
  • 69
0

This answer builds on @Feng Jiangs answer using the dplyr::left_joint() operation, and is more like a reprex. In addition, it ensures the proper order of columns is restored in case the order of variables in df_filter differs from the order of the variables in the original dataset. Also, the dataset was expanded for a duplicate combination to show these are part of the filtered output (df_out).

library(dplyr)

father<- c(1, 1, 1, 1, 1,1)
mother<- c(1, 1, 1, NA, NA,1) 
children <- c(NA, NA, 2, 5, 2,NA) 
cousins   <- c(NA, 5, 1, 1, 4,NA)

dataset <- data.frame(father, mother, children, cousins) 
df_filter <- data.frame( father = 1, mother = 1, children = NA, cousins = NA)
                         
test  <- df_filter  %>% 
  left_join(dataset) %>% 
  relocate(colnames(dataset))
André
  • 31
  • 6