How to remove duplicate rows in all but a few columns?

Question

I need to remove duplicated rows where all values are duplicates apart from two columns. How do I go about this?

The posted solutions are along the lines of this:

df[!duplicated(df[ , c("x","y")]),]

Where the duplicated rows are selected based on if the value appears again in the stated columns, i.e. x and y in the example above. Therefore the suggested answer by the person that closed my question is not helpful (R dataframe: drop duplicates based on certain columns [duplicate]).

Some of my data is duplicated in all 40 columns except two. I would therefore like to use these two columns as the condition (to ignore that is). I can remove the duplicated rows if I don't take into account one column like the example below. Hence, the second suggested answer is also redundant (Select all rows which are duplicates except for one column).

df[!duplicated(df[, -5]), ]

This is by using the column number but I can't get it to work for two columns.

Here is an example data frame where rows (see x5) 4, 10 and 16 need to be removed since they are duplicated in each column except x5 and x6.

df1 <- data.frame(x1 = c("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3"),
                  x2 = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
                  x3 = c("A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "B2", "B2", "B3", "B3", "C1", "C1", "C2", "C2", "C3", "C3"),
                  x4 = c("35", "43", "33", "33", "63", "24", "14", "25", "77", "77", "94", "51", "34", "36", "55", "55", "73", "72"),
                  x5 = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18"),
                  x6 = c("XA", "XB", "XC", "XD", "XE", "XF", "XG", "XH", "XI", "XJ", "XK", "XL", "XM", "XN", "XO", "XP", "XQ", "XR"))

This is my desired outcome.

df2 <- data.frame(x1 = c("1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3"),
                  x2 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C"),
                  x3 = c("A1", "A1", "A2", "A3", "A3", "B1", "B1", "B2", "B3", "B3", "C1", "C1", "C2", "C3", "C3"),
                  x4 = c("35", "43", "33", "63", "24", "14", "25", "77", "94", "51", "34", "36", "55", "73", "72"),
                  x5 = c("1", "2", "3", "5", "6", "7", "8", "9", "11", "12", "13", "14", "15", "17", "18"),
                  x6 = c("XA", "XB", "XC", "XE", "XF", "XG", "XH", "XI", "XK", "XL", "XM", "XN", "XO", "XQ", "XR"))

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Aug 09 '23 at 14:48
Yes, I have added some example data now, thanks MrFlick. Ian, the problem is that I have two columns that I wish to ignore. I am not sure how to implement the previously posted solutions when there is more than one column to ignore. — Joel Blomqvist, Aug 09 '23 at 15:37
Maybe you want `df1 %>% group_by(pick(x1:x4)) %>% summarize(across(x5:x6 , first))` You need to supply a rule to collapse each of the columns you are ignoring for duplicates. Here it takes the first values for the duplicated groups. But `df1[!duplicated(df1[ , c("x1","x2", "x3","x4")]),]` and `df1[!duplicated(df1[, -5:-6]), ] ` also seems to return the same thing. So it's unclear why that wouldn't work. — MrFlick, Aug 11 '23 at 20:35

score 0 · Answer 1 · answered Aug 09 '23 at 14:53

0

you may try the below code, assuming that Identifier, Suffix are the variable on which you want to remove the duplicates, if you want only one variable i.e., Identifier, then keep it and remove Suffix

library(dplyr)

unique_df <- df_split |>
  distinct(Amount, Identifier, Suffix, .keep_all = TRUE)

answered Aug 09 '23 at 14:53

jkatam

2,691
1
4
12

OP requested to select by the two ignored columns, not the 38 de-duped ones. Maybe `df |> distinct(pick(-c(x,y), .keep_all = TRUE)`? – Jon Spring Aug 09 '23 at 15:02

How to remove duplicate rows in all but a few columns?

1 Answers1