R: finding duplicated rows according to multiple columns but not removing the rows

Question

This is probably asked elsewhere , however I can't get to it with the different search phrasing I've tried... My question is about finding duplicates but not removing them. I've found a few examples on how to remove rows based on multiple columns (e.g. link1, link2) but I want to keep them in the dataset, thus the number and order of rows is not altered.

An example of my data:

shop_code <- c(1, 1, 1, 5, 1, 1, 1, 2, 2, 2, 1, 2, 3, 1)
brand <- c("Renault", "Mazda", "Peugeot", "Renault", "Peugeot", "Volkswagen", "Renault", "Ford", "Renault", "Peugeot", "Renault", "Mazda", "Ford", "Ford")
date <- c("2023-01-04", "2023-01-04", "2023-01-13", "2023-01-04", "2023-01-13", "2020-08-23", "2023-01-18", "2022-04-01", "2021-09-20", "2001-05-02", "2023-01-04", "2023-01-05", "2011-03-30", "2011-03-30")
df <- data.frame(shop_code, brand, date)
revision <- c("1", "", "", "5", "", "", "", "", "", "", "1", "", "3", "1")
df$revision <- revision

In table format:

shop_code	brand	date	revision
1	Renault	2023-01-04	1
1	Mazda	2023-01-04
1	Peugeot	2023-01-13
5	Renault	2023-01-04	5
1	Peugeot	2023-01-13
1	Volkswagen	2020-08-23
1	Renault	2023-01-18
2	Ford	2022-04-01
2	Renault	2021-09-20
2	Peugeot	2001-05-02
1	Renault	2023-01-04	1
2	Mazda	2023-01-05
3	Ford	2011-03-30	3
1	Ford	2011-03-30	1

My goal is to find those rows with the same values in the 'brand' AND 'date' columns AND with a different value in 'shop_code'. When this criteria is met, a message is added in the column "revision" (eg. the name of the shop).

Note that the three rows of Renault on 2023-01-04 should be detected since one of them is from a different shop (shops 1 and 5).

Please, any ideas?

Wael · Answer 1 · 2023-03-17T13:28:55.327

Maybe with dplyr and

group_by to group brand and date
n_distinct to count nb of distinct shop_code

intermediate variables could be removed at the end of the dplyr pipe.

Here I display them for educational purposes and better generalization

df%>%
  group_by(brand,date)%>%
  mutate(n_rows=n())%>%
  mutate(distinct_shop=n_distinct(shop_code))%>%
  mutate(duplicated=n_rows>1 & distinct_shop >1 ) 

# A tibble: 14 × 7
# Groups:   brand, date [10]
   shop_code brand      date       revision n_rows distinct_shop duplicated
       <dbl> <chr>      <chr>      <chr>     <int>         <int> <lgl>     
 1         1 Renault    2023-01-04 "1"           3             2 TRUE      
 2         1 Mazda      2023-01-04 ""            1             1 FALSE     
 3         1 Peugeot    2023-01-13 ""            2             1 FALSE     
 4         5 Renault    2023-01-04 "5"           3             2 TRUE      
 5         1 Peugeot    2023-01-13 ""            2             1 FALSE     
 6         1 Volkswagen 2020-08-23 ""            1             1 FALSE     
 7         1 Renault    2023-01-18 ""            1             1 FALSE     
 8         2 Ford       2022-04-01 ""            1             1 FALSE     
 9         2 Renault    2021-09-20 ""            1             1 FALSE     
10         2 Peugeot    2001-05-02 ""            1             1 FALSE     
11         1 Renault    2023-01-04 "1"           3             2 TRUE      
12         2 Mazda      2023-01-05 ""            1             1 FALSE     
13         3 Ford       2011-03-30 "3"           2             2 TRUE      
14         1 Ford       2011-03-30 "1"           2             2 TRUE

Wael, the different shop values criteria is missing in your reply. But thank you very much because I like this approach, is perfect for me in those cases when I want to use only the criteria based on two columns. — Andres, Mar 17 '23 at 10:41
Andres, thanks for the feedback, Changes were made accordingly — Wael, Mar 17 '23 at 13:31

score 1 · Accepted Answer · answered Mar 17 '23 at 10:23

library(dplyr)

df %>% 
  group_by(brand, date) %>% 
  mutate(revision = if (n_distinct(shop_code) >= 2) shop_code else NA)

# A tibble: 14 × 4
# Groups:   brand, date [10]
   shop_code brand      date       revision
       <dbl> <chr>      <chr>         <dbl>
 1         1 Renault    2023-01-04        1
 2         1 Mazda      2023-01-04       NA
 3         1 Peugeot    2023-01-13       NA
 4         5 Renault    2023-01-04        5
 5         1 Peugeot    2023-01-13       NA
 6         1 Volkswagen 2020-08-23       NA
 7         1 Renault    2023-01-18       NA
 8         2 Ford       2022-04-01       NA
 9         2 Renault    2021-09-20       NA
10         2 Peugeot    2001-05-02       NA
11         1 Renault    2023-01-04        1
12         2 Mazda      2023-01-05       NA
13         3 Ford       2011-03-30        3
14         1 Ford       2011-03-30        1

Thank you Aurèle. Amazing! works perfectly, with just two lines. — Andres, Mar 17 '23 at 10:43

R: finding duplicated rows according to multiple columns but not removing the rows

2 Answers2