0

I am looking to remove duplicates from my code with the distinct() function, however I don't want it to remove some duplicates based on an additional condition.

For example, my data frame has variables of position, place, company, and source, so I want the rows of duplicates only to be removed when position, place, and company are the same, and while the source variable is different. This is the function I am using.

omit <- distinct(final, position, place, company, .keep_all = TRUE)

I just want the duplicates to be removed when they are matching for three variables, while remaining different for the other one. Is there some other way this could be done?

SDJ
  • 4,083
  • 1
  • 17
  • 35

1 Answers1

1

Using base R, you can index duplicates and then subset to your original data frame :

set.seed(123)
dd <- data.frame(matrix(sample(1:2, 10*3, TRUE), ncol = 3), "fv" = gl(2, 5,labels = letters[1:2]))
unique(dd) # 7 unique rows with all variables
#>    X1 X2 X3 fv
#> 1   1  2  2  a
#> 2   2  1  2  a
#> 4   2  2  2  a
#> 6   1  2  2  b
#> 7   2  1  2  b
#> 9   2  1  1  b
#> 10  1  2  1  b
col_dup <- names(dd)[1:3] # set which set of columns to get duplicated from
# unique(dd[,col_dup]) # what you expect in final for those rows

ind_dup <- duplicated(dd[,col_dup]) # get index of duplicated elements
new_dd <- dd[!ind_dup,]
new_dd
#>    X1 X2 X3 fv
#> 1   1  2  2  a
#> 2   2  1  2  a
#> 4   2  2  2  a
#> 9   2  1  1  b
#> 10  1  2  1  b
cbo
  • 1,664
  • 1
  • 12
  • 27