1

I am trying to compare two columns in a dataframe to find rows where the two columns are not equal.

I would do:

df %>% filter(column1 != column2)

This will give me cases where values exist in both columns and are not equal (e.g. column1 = 5, column2 = 6)

However it will not give me cases where one of the values is NA (e.g. column1 = NA, column2 = 7)

How can I include the latter case into the filter function?

Thanks

MLE
  • 185
  • 8
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Oct 27 '21 at 03:46

3 Answers3

3

Or use xor:

df %>% filter(a != b | xor(is.na(a), is.na(b)))

Or as @thelatemail mentioned, you could use Base R:

df[which(df$a != df$b | xor(is.na(df$a), is.na(df$b))),]

Or as @runr mentioned, you could try subset in Base R:

subset(df, a != b | xor(is.na(a), is.na(b)))
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • I think this code will include (NA, NA), which I do not want in the final df since they are 'equal' for my purposes – MLE Oct 27 '21 at 03:42
  • @MLE Check it out now, `xor` – U13-Forward Oct 27 '21 at 03:46
  • @MLE `xor` is `or`, but not `and`. So it selects if one out of the two is `NA`, but won't select if both of them are – U13-Forward Oct 27 '21 at 03:47
  • 1
    Nice logic. Interesting that this selection only works in `dplyr::filter` because `NA` selections are dropped. Got to add a `which()` for `[]` selection - `df[which(df$a != df$b | xor(is.na(df$a), is.na(df$b))),]` – thelatemail Oct 27 '21 at 04:45
  • @thelatemail Yeap! Thanks for a solution in Base R, added that! – U13-Forward Oct 27 '21 at 05:12
  • 1
    Also just ``subset(df, a != b | xor(is.na(a), is.na(b)))`` instead of ``with()`` does the job with base R – runr Oct 27 '21 at 06:01
2

You can include them with an OR (|) condition -

library(dplyr)

df <- data.frame(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8))

df %>% filter(a != b | is.na(a) | is.na(b))

#   a  b
#1  1 NA
#2 NA  3
#3  5  8

Another option would be to change NA values to string "NA" and then only using a != b should work.

df %>% 
  mutate(across(.fns = ~replace(., is.na(.), 'NA'))) %>%
  filter(a != b) %>%
  type.convert(as.is = TRUE) 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • With this, wouldn't (NA, NA) be included? (which I don't want since they are 'equal' for my purposes) Given that NA == NA is NA not TRUE. Just looking for a clean solution – MLE Oct 27 '21 at 03:39
  • 1
    @MLE See the updated answer. An example dataframe with a sample output would have helped to better understand the problem. – Ronak Shah Oct 27 '21 at 03:45
0

We can use if_any

library(dplyr)
df %>% 
   filter(a != b | if_any(everything(), is.na))
   a  b
1  1 NA
2 NA  3
3  5  8

data

df <- structure(list(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8)),
 class = "data.frame", row.names = c(NA, 
-5L))
akrun
  • 874,273
  • 37
  • 540
  • 662