Is there anyway to remove duplicates from a dataframe based on certain conditions?

Question

I am looking to remove duplicates from my code with the distinct() function, however I don't want it to remove some duplicates based on an additional condition.

For example, my data frame has variables of position, place, company, and source, so I want the rows of duplicates only to be removed when position, place, and company are the same, and while the source variable is different. This is the function I am using.

omit <- distinct(final, position, place, company, .keep_all = TRUE)

I just want the duplicates to be removed when they are matching for three variables, while remaining different for the other one. Is there some other way this could be done?

`duplicated()` will return booleans you can combine with your other conditions. — Alexis, Jun 21 '19 at 10:03
@StephenHenderson No, because I need it to follow the additional condition that it removes the duplicates only if the source variables do not match. — Tomas Vyšniauskas, Jun 21 '19 at 10:05
OK pardon my confusion - then that is just `distinct(final, .keep_all = TRUE)` ? — Stephen Henderson, Jun 21 '19 at 10:08
@StephenHenderson, no because then when `source` is different, it will keep them. — Sven, Jun 21 '19 at 11:19

score 1 · Answer 1 · answered Jun 21 '19 at 12:02

Using base R, you can index duplicates and then subset to your original data frame :

set.seed(123)
dd <- data.frame(matrix(sample(1:2, 10*3, TRUE), ncol = 3), "fv" = gl(2, 5,labels = letters[1:2]))
unique(dd) # 7 unique rows with all variables
#>    X1 X2 X3 fv
#> 1   1  2  2  a
#> 2   2  1  2  a
#> 4   2  2  2  a
#> 6   1  2  2  b
#> 7   2  1  2  b
#> 9   2  1  1  b
#> 10  1  2  1  b
col_dup <- names(dd)[1:3] # set which set of columns to get duplicated from
# unique(dd[,col_dup]) # what you expect in final for those rows

ind_dup <- duplicated(dd[,col_dup]) # get index of duplicated elements
new_dd <- dd[!ind_dup,]
new_dd
#>    X1 X2 X3 fv
#> 1   1  2  2  a
#> 2   2  1  2  a
#> 4   2  2  2  a
#> 9   2  1  1  b
#> 10  1  2  1  b

Is there anyway to remove duplicates from a dataframe based on certain conditions?

1 Answers1