0

This is probably asked elsewhere , however I can't get to it with the different search phrasing I've tried... My question is about finding duplicates but not removing them. I've found a few examples on how to remove rows based on multiple columns (e.g. link1, link2) but I want to keep them in the dataset, thus the number and order of rows is not altered.

An example of my data:

shop_code <- c(1, 1, 1, 5, 1, 1, 1, 2, 2, 2, 1, 2, 3, 1)
brand <- c("Renault", "Mazda", "Peugeot", "Renault", "Peugeot", "Volkswagen", "Renault", "Ford", "Renault", "Peugeot", "Renault", "Mazda", "Ford", "Ford")
date <- c("2023-01-04", "2023-01-04", "2023-01-13", "2023-01-04", "2023-01-13", "2020-08-23", "2023-01-18", "2022-04-01", "2021-09-20", "2001-05-02", "2023-01-04", "2023-01-05", "2011-03-30", "2011-03-30")
df <- data.frame(shop_code, brand, date)
revision <- c("1", "", "", "5", "", "", "", "", "", "", "1", "", "3", "1")
df$revision <- revision

In table format:

shop_code brand date revision
1 Renault 2023-01-04 1
1 Mazda 2023-01-04
1 Peugeot 2023-01-13
5 Renault 2023-01-04 5
1 Peugeot 2023-01-13
1 Volkswagen 2020-08-23
1 Renault 2023-01-18
2 Ford 2022-04-01
2 Renault 2021-09-20
2 Peugeot 2001-05-02
1 Renault 2023-01-04 1
2 Mazda 2023-01-05
3 Ford 2011-03-30 3
1 Ford 2011-03-30 1

My goal is to find those rows with the same values in the 'brand' AND 'date' columns AND with a different value in 'shop_code'. When this criteria is met, a message is added in the column "revision" (eg. the name of the shop).

Note that the three rows of Renault on 2023-01-04 should be detected since one of them is from a different shop (shops 1 and 5).

Please, any ideas?

Andres
  • 59
  • 6

2 Answers2

1

Maybe with dplyr and

  1. group_by to group brand and date
  2. n_distinct to count nb of distinct shop_code

intermediate variables could be removed at the end of the dplyr pipe.

Here I display them for educational purposes and better generalization

df%>%
  group_by(brand,date)%>%
  mutate(n_rows=n())%>%
  mutate(distinct_shop=n_distinct(shop_code))%>%
  mutate(duplicated=n_rows>1 & distinct_shop >1 ) 

# A tibble: 14 × 7
# Groups:   brand, date [10]
   shop_code brand      date       revision n_rows distinct_shop duplicated
       <dbl> <chr>      <chr>      <chr>     <int>         <int> <lgl>     
 1         1 Renault    2023-01-04 "1"           3             2 TRUE      
 2         1 Mazda      2023-01-04 ""            1             1 FALSE     
 3         1 Peugeot    2023-01-13 ""            2             1 FALSE     
 4         5 Renault    2023-01-04 "5"           3             2 TRUE      
 5         1 Peugeot    2023-01-13 ""            2             1 FALSE     
 6         1 Volkswagen 2020-08-23 ""            1             1 FALSE     
 7         1 Renault    2023-01-18 ""            1             1 FALSE     
 8         2 Ford       2022-04-01 ""            1             1 FALSE     
 9         2 Renault    2021-09-20 ""            1             1 FALSE     
10         2 Peugeot    2001-05-02 ""            1             1 FALSE     
11         1 Renault    2023-01-04 "1"           3             2 TRUE      
12         2 Mazda      2023-01-05 ""            1             1 FALSE     
13         3 Ford       2011-03-30 "3"           2             2 TRUE      
14         1 Ford       2011-03-30 "1"           2             2 TRUE      
Wael
  • 1,640
  • 1
  • 9
  • 20
  • 1
    Wael, the different shop values criteria is missing in your reply. But thank you very much because I like this approach, is perfect for me in those cases when I want to use only the criteria based on two columns. – Andres Mar 17 '23 at 10:41
  • Andres, thanks for the feedback, Changes were made accordingly – Wael Mar 17 '23 at 13:31
1
library(dplyr)

df %>% 
  group_by(brand, date) %>% 
  mutate(revision = if (n_distinct(shop_code) >= 2) shop_code else NA)
# A tibble: 14 × 4
# Groups:   brand, date [10]
   shop_code brand      date       revision
       <dbl> <chr>      <chr>         <dbl>
 1         1 Renault    2023-01-04        1
 2         1 Mazda      2023-01-04       NA
 3         1 Peugeot    2023-01-13       NA
 4         5 Renault    2023-01-04        5
 5         1 Peugeot    2023-01-13       NA
 6         1 Volkswagen 2020-08-23       NA
 7         1 Renault    2023-01-18       NA
 8         2 Ford       2022-04-01       NA
 9         2 Renault    2021-09-20       NA
10         2 Peugeot    2001-05-02       NA
11         1 Renault    2023-01-04        1
12         2 Mazda      2023-01-05       NA
13         3 Ford       2011-03-30        3
14         1 Ford       2011-03-30        1
Aurèle
  • 12,545
  • 1
  • 31
  • 49