This is probably asked elsewhere , however I can't get to it with the different search phrasing I've tried... My question is about finding duplicates but not removing them. I've found a few examples on how to remove rows based on multiple columns (e.g. link1, link2) but I want to keep them in the dataset, thus the number and order of rows is not altered.
An example of my data:
shop_code <- c(1, 1, 1, 5, 1, 1, 1, 2, 2, 2, 1, 2, 3, 1)
brand <- c("Renault", "Mazda", "Peugeot", "Renault", "Peugeot", "Volkswagen", "Renault", "Ford", "Renault", "Peugeot", "Renault", "Mazda", "Ford", "Ford")
date <- c("2023-01-04", "2023-01-04", "2023-01-13", "2023-01-04", "2023-01-13", "2020-08-23", "2023-01-18", "2022-04-01", "2021-09-20", "2001-05-02", "2023-01-04", "2023-01-05", "2011-03-30", "2011-03-30")
df <- data.frame(shop_code, brand, date)
revision <- c("1", "", "", "5", "", "", "", "", "", "", "1", "", "3", "1")
df$revision <- revision
In table format:
shop_code | brand | date | revision |
---|---|---|---|
1 | Renault | 2023-01-04 | 1 |
1 | Mazda | 2023-01-04 | |
1 | Peugeot | 2023-01-13 | |
5 | Renault | 2023-01-04 | 5 |
1 | Peugeot | 2023-01-13 | |
1 | Volkswagen | 2020-08-23 | |
1 | Renault | 2023-01-18 | |
2 | Ford | 2022-04-01 | |
2 | Renault | 2021-09-20 | |
2 | Peugeot | 2001-05-02 | |
1 | Renault | 2023-01-04 | 1 |
2 | Mazda | 2023-01-05 | |
3 | Ford | 2011-03-30 | 3 |
1 | Ford | 2011-03-30 | 1 |
My goal is to find those rows with the same values in the 'brand' AND 'date' columns AND with a different value in 'shop_code'. When this criteria is met, a message is added in the column "revision" (eg. the name of the shop).
Note that the three rows of Renault on 2023-01-04 should be detected since one of them is from a different shop (shops 1 and 5).
Please, any ideas?