I am absolutely lost as to how to filter duplicates based on the value of more than one string variable. Sadly, my dataset is private, but I can offer a glimpse of it with fake data:
id = c(1, 1, 2, 2, 5, 6, 6)
car = c(0, 1, 1, 1, 1, 1, 1)
insurance = c("no", "yes", "yes", "yes", "no", "yes", "yes")
ins_type = c("", "liab", "liab", "full", "", "full", "liab")
df = data.frame(id, car, insurance, ins_type)`
Which buils this data.frame:
id car insurance ins_type`
1 0 no
1 1 yes liab
2 1 yes liab
2 1 yes full
5 1 no
6 1 yes full
6 1 yes liab
where:
a. id = person
b. car = 0 is NO and 1 is YES
c. insurance = whether or not that person has one, and
d, ins_type = liability or full
I need to remove all duplicate individuals. My desired dataset is people who:
- Appear once in the dataset, regardles of owning a car, then;
- People who own a car, then preferably those who;
- Have insurance, then preferably those who;
- Have FULL insurance.
That is:
id car insurance ins_type
1 1 yes liab
2 1 yes full
5 1 no
6 1 yes full
Notice that 5 has to stay, for it appears only once. All duplicates were removed. Person #1 has two connections, but only one based on owning a vehicle, so that was kept.
I have the following dplyr code:
df = df %>%
group_by(id) %>%
filter(car == 1) %>%
filter(insurance == "yes") %>%
filter(ins_type == "full")
But that results in:
id car insurance ins_type
2 1 yes full
6 1 yes full
I have also tried
df %>% group_by(id, car) %>% distinct(insurance)
but that results in
id car insurance
1 0 no
1 1 yes
2 1 yes
5 1 no
6 1 yes
The first line should not be there.
I have searched this topic extensively and found a number of answers for the question "how to conditionally filter duplicate rows." Most of them -- such as this and this -- deal with keeping one of the rows with either the highest ir lowest value. Others deal with arbitrary/random filtering. I need to follow the logic above.
Any insights are very welcome.
EDIT
All the answers below are highly satisfactory and solved the problem in their own way. I've voted for @storaged 's one because the heart of the solution for my problem was to use factor levels so as to create a hierarchy. I appreaciate your help and teachings, and hope I can be of help to you or the community one day.