Remove duplicates in one column based on another column

Question

I'm looking for a nicer way to do this in R. I do have one possibility but it seems like there should be a smart/more readable way.

I want to delete duplicates in one/more column only if a condition is met in another column (or columns).

In my simplified example I want to delete duplicates in column X only if column Y is NA, but keep NA's in Y without a duplicated X.

testDF<- data.frame(X= c(1:4,4:8,8:12), Y = 1:14)
testDF$Y[c(4,6,10)]<- NA

My current solution is:

testDF[!(testDF$X %in% testDF$X[which(duplicated(testDF$X))] & is.na(testDF$Y)),]

or

library(dplyr)

testDF %>% 
    dplyr::filter(!(testDF$X%in% testDF$X[which(duplicated(testDF$X))] & is.na(testDF$Y)))

which both appear messy and confusing, and in a real application where I am going to be looking at more than two columns could get unworkable.

I was hoping for something more along the lines of:

 testDF %>% dplyr::filter(!(duplicated(X) & is.na(Y)))

but it duplicated() only identifies the second instance of a duplication so if Y's NA is in line with the first of the duplicated X values then it will not be filtered out.

Preferably looking for a base or tidyverse solution as none of the rest of the script is using data.table

You current dplyr solution can be simplified to: `testDF %>% dplyr::filter(!(X %in% X[duplicated(X)] & is.na(Y)))`. — mt1022, Jun 22 '18 at 01:28

score 2 · Answer 1 · answered Jun 22 '18 at 01:35

2

We can arrange the columns, group by X, and then slice the first row. By doing this, we can get the non-NA row in Y if that row exists.

library(dplyr)

testDF %>%
  arrange(X, Y) %>%
  group_by(X) %>%
  slice(1) %>%
  ungroup()
# # A tibble: 12 x 2
#        X     Y
#    <int> <int>
#  1     1     1
#  2     2     2
#  3     3     3
#  4     4     5
#  5     5    NA
#  6     6     7
#  7     7     8
#  8     8     9
#  9     9    11
# 10    10    12
# 11    11    13
# 12    12    14

answered Jun 22 '18 at 01:35

www

38,575
12
48
84

1

I really like this solution for this exact example, but I think that to apply to more columns when the my criteria may not be excluding `NA`s it might be more difficult to implement. +1 because for simple versions it will work well – Sarah Jun 22 '18 at 02:48
1

Hmm, just realised that if I have multiple non-na duplicates this won't work though which would need to be included – Sarah Jun 22 '18 at 02:52
@www, I am looking for the same solution and wonder if you've found a solid answer to this question where I have multiple rows with `NAs` or/and `empty cells`? – Alex May 27 '21 at 16:27
@Alex You can post a new question with reproducible examples if your question is different from this one. – www May 28 '21 at 02:08

Melissa Key · Accepted Answer · 2018-06-22T01:46:22.660

2

You can also just apply duplicated from both directions:

testDF %>%
  filter(!is.na(Y) | (!duplicated(X) & !duplicated(X, fromLast = TRUE) ))

(highly influenced by this: Find duplicated elements with dplyr - I'll let others decide if this is close enough to be a duplicate)

To make your code even more readable, you can even put this in a function (perhaps with a better function name than mine):

all_duplicates <- function(x) {
  duplicated(x) | duplicated(x, fromLast = TRUE)
}
testDF %>%
  filter(!is.na(Y) | !all_duplicates(X) )

edited Jun 22 '18 at 01:46

answered Jun 22 '18 at 01:40

Melissa Key

4,476
12
21

Thanks, I did a solutions using `duplicated(x) | duplicated(x, fromLast = TRUE)` shortly after asking the question but thought I'd leave it up to see what other approaches might be possible. This is very simply though and with it in its own function would make applying across other columns and criteria easy and readable – Sarah Jun 22 '18 at 02:59

score 0 · Answer 3 · edited Jun 22 '18 at 02:57

0

I had a similar idea like what @www suggested, arrange X and Y first, and then filter instead of slice.

The first filtering condition would be row_number()==1 , as if row_number == 1 it means X is distinct.
The second filtering condition would be !is.na(Y) , which would only keep any X whose Y is not NA.

testDF %>% arrange(X,Y) %>%  
      group_by(X) %>% filter(row_number()==1 | !is.na(Y)) %>% ungroup()

edited Jun 22 '18 at 02:57

Mihai Chelaru

7,614
14
45
51

answered Jun 22 '18 at 02:37

DDZR

1
1

Thanks, I like this answer but may get confusing if I wanted to apply to more columns with individual conditions – Sarah Jun 22 '18 at 03:08
@Sarah, I am looking for the same solution and wonder if you've found a solid answer to your question in case you have multiple rows with `NAs` or `empty cells`? – Alex May 27 '21 at 16:20
@Mihai Chelaru, can you please advise, I just commented on your answer when there are multiple rows with `NAs` or `empty cells`? – Alex May 27 '21 at 16:22
@Alex I am the last editor of this answer, not the original author. DDZR is the author, so perhaps they can provide an update, but please be patient as their profile says they have not been online since May 23rd. I don't currently program in R so I'm afraid I can't be of any use here. – Mihai Chelaru May 27 '21 at 17:32
@MihaiChelaru, thanks for the update anyway. Much appreciated. – Alex May 27 '21 at 17:51
Hi @Alex I honestly cannot remember why I even asked this question, let alone what I ended up doing about it, sorry I can't be more helpful – Sarah May 28 '21 at 03:43

Remove duplicates in one column based on another column

3 Answers3

Linked