1

Seeking some expertise/guidance on creating a new column to indicate possible duplicates based on a few selected columns.

I have the following dataframe

ID Animal Age Delivery Cost Country
1 dog 5 Air 120 Nigeria
2 cat 3 Air 110 Kenya
3 fish 1 Air 20 Kenya
4 dog 5 Air 150 Nigeria
5 cat 3 Air 100 Kenya
6 dog 6 Air 180 Egypt
7 cat 3 Air 135 Kenya
8 turtle 10 Air 90 Nigeria
df = structure(list(ID = 1:8, Animals = c("dog", "cat", "fish", "dog", 
                                          "cat", "dog", "cat", "turtle"), Age = c(5L, 3L, 1L, 5L, 3L, 6L, 
                                           3L, 10L), Delivery = c("Air", "Air", "Air", "Air", "Air", "Air", 
                                           "Air", "Air"), Cost = c(120L, 110L, 20L, 150L, 100L, 180L, 135L, 
                                           90L), Country = c("Nigeria", "Kenya", "Kenya", "Nigeria", "Kenya", 
                                           "Egypt", "Kenya", "Nigeria")), class = "data.frame", row.names = c(NA, 
                                           -8L))

I would like to create a new column that highlights if the 3 columns - Animal, Age & Country repeats more than once to tag as duplicates.

The output would be the following

ID Animal Age Delivery Cost Country New Column
1 dog 5 Air 120 Nigeria Y
2 cat 3 Air 110 Kenya Y
3 fish 1 Air 20 Kenya N
4 dog 5 Air 150 Nigeria Y
5 cat 3 Air 100 Kenya Y
6 dog 6 Air 180 Egypt N
7 cat 3 Air 135 Kenya Y
8 turtle 10 Air 90 Nigeria N

Thanks in advance!

benson23
  • 16,369
  • 9
  • 19
  • 38
Luther_Proton
  • 348
  • 1
  • 7
  • I think this is essentially this r-FAQ - [Finding ALL duplicate rows](https://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts) - e.g.: `{function(x) duplicated(x) | duplicated(x, fromLast=TRUE)}(df[c("Animals","Age","Country")])` and the [tidyverse equivalent answer too](https://stackoverflow.com/a/56632683/496803) . – thelatemail Mar 24 '23 at 05:16

1 Answers1

1

You can group_by your target columns, then use n() to see if there's more than one record per group.

library(dplyr)

df %>% group_by(Animals, Age, Country) %>% mutate(dup = n() > 1) %>% ungroup()

# A tibble: 8 × 7
     ID Animals   Age Delivery  Cost Country dup  
  <int> <chr>   <int> <chr>    <int> <chr>   <lgl>
1     1 dog         5 Air        120 Nigeria TRUE 
2     2 cat         3 Air        110 Kenya   TRUE 
3     3 fish        1 Air         20 Kenya   FALSE
4     4 dog         5 Air        150 Nigeria TRUE 
5     5 cat         3 Air        100 Kenya   TRUE 
6     6 dog         6 Air        180 Egypt   FALSE
7     7 cat         3 Air        135 Kenya   TRUE 
8     8 turtle     10 Air         90 Nigeria FALSE
benson23
  • 16,369
  • 9
  • 19
  • 38