New column to identify duplicate rows in R

Question

Seeking some expertise/guidance on creating a new column to indicate possible duplicates based on a few selected columns.

I have the following dataframe

ID	Animal	Age	Delivery	Cost	Country
1	dog	5	Air	120	Nigeria
2	cat	3	Air	110	Kenya
3	fish	1	Air	20	Kenya
4	dog	5	Air	150	Nigeria
5	cat	3	Air	100	Kenya
6	dog	6	Air	180	Egypt
7	cat	3	Air	135	Kenya
8	turtle	10	Air	90	Nigeria

df = structure(list(ID = 1:8, Animals = c("dog", "cat", "fish", "dog", 
                                          "cat", "dog", "cat", "turtle"), Age = c(5L, 3L, 1L, 5L, 3L, 6L, 
                                           3L, 10L), Delivery = c("Air", "Air", "Air", "Air", "Air", "Air", 
                                           "Air", "Air"), Cost = c(120L, 110L, 20L, 150L, 100L, 180L, 135L, 
                                           90L), Country = c("Nigeria", "Kenya", "Kenya", "Nigeria", "Kenya", 
                                           "Egypt", "Kenya", "Nigeria")), class = "data.frame", row.names = c(NA, 
                                           -8L))

I would like to create a new column that highlights if the 3 columns - Animal, Age & Country repeats more than once to tag as duplicates.

The output would be the following

ID	Animal	Age	Delivery	Cost	Country	New Column
1	dog	5	Air	120	Nigeria	Y
2	cat	3	Air	110	Kenya	Y
3	fish	1	Air	20	Kenya	N
4	dog	5	Air	150	Nigeria	Y
5	cat	3	Air	100	Kenya	Y
6	dog	6	Air	180	Egypt	N
7	cat	3	Air	135	Kenya	Y
8	turtle	10	Air	90	Nigeria	N

Thanks in advance!

I think this is essentially this r-FAQ - [Finding ALL duplicate rows](https://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts) - e.g.: `{function(x) duplicated(x) | duplicated(x, fromLast=TRUE)}(df[c("Animals","Age","Country")])` and the [tidyverse equivalent answer too](https://stackoverflow.com/a/56632683/496803) . — thelatemail, Mar 24 '23 at 05:16

score 1 · Accepted Answer · answered Mar 24 '23 at 04:55

You can group_by your target columns, then use n() to see if there's more than one record per group.

library(dplyr)

df %>% group_by(Animals, Age, Country) %>% mutate(dup = n() > 1) %>% ungroup()

# A tibble: 8 × 7
     ID Animals   Age Delivery  Cost Country dup  
  <int> <chr>   <int> <chr>    <int> <chr>   <lgl>
1     1 dog         5 Air        120 Nigeria TRUE 
2     2 cat         3 Air        110 Kenya   TRUE 
3     3 fish        1 Air         20 Kenya   FALSE
4     4 dog         5 Air        150 Nigeria TRUE 
5     5 cat         3 Air        100 Kenya   TRUE 
6     6 dog         6 Air        180 Egypt   FALSE
7     7 cat         3 Air        135 Kenya   TRUE 
8     8 turtle     10 Air         90 Nigeria FALSE

New column to identify duplicate rows in R

1 Answers1