I'm looking for a nicer way to do this in R. I do have one possibility but it seems like there should be a smart/more readable way.
I want to delete duplicates in one/more column only if a condition is met in another column (or columns).
In my simplified example I want to delete duplicates in column X
only if column Y
is NA
, but keep NA
's in Y without a duplicated X.
testDF<- data.frame(X= c(1:4,4:8,8:12), Y = 1:14)
testDF$Y[c(4,6,10)]<- NA
My current solution is:
testDF[!(testDF$X %in% testDF$X[which(duplicated(testDF$X))] & is.na(testDF$Y)),]
or
library(dplyr)
testDF %>%
dplyr::filter(!(testDF$X%in% testDF$X[which(duplicated(testDF$X))] & is.na(testDF$Y)))
which both appear messy and confusing, and in a real application where I am going to be looking at more than two columns could get unworkable.
I was hoping for something more along the lines of:
testDF %>% dplyr::filter(!(duplicated(X) & is.na(Y)))
but it duplicated()
only identifies the second instance of a duplication so if Y
's NA
is in line with the first of the duplicated X values then it will not be filtered out.
Preferably looking for a base or tidyverse solution as none of the rest of the script is using data.table