2

I am absolutely lost as to how to filter duplicates based on the value of more than one string variable. Sadly, my dataset is private, but I can offer a glimpse of it with fake data:

id = c(1, 1, 2, 2, 5, 6, 6)
car = c(0, 1, 1, 1, 1, 1, 1) 
insurance = c("no", "yes", "yes", "yes", "no", "yes", "yes")
ins_type = c("", "liab", "liab", "full", "", "full", "liab")
df = data.frame(id, car, insurance, ins_type)`

Which buils this data.frame:

id car insurance ins_type`
 1   0        no
 1   1       yes     liab
 2   1       yes     liab
 2   1       yes     full
 5   1        no 
 6   1       yes     full
 6   1       yes     liab

where:

a. id = person
b. car = 0 is NO and 1 is YES
c. insurance = whether or not that person has one, and  
d, ins_type = liability or full

I need to remove all duplicate individuals. My desired dataset is people who:

  1. Appear once in the dataset, regardles of owning a car, then;
  2. People who own a car, then preferably those who;
  3. Have insurance, then preferably those who;
  4. Have FULL insurance.

That is:

id car insurance ins_type
 1   1       yes     liab
 2   1       yes     full
 5   1        no 
 6   1       yes     full

Notice that 5 has to stay, for it appears only once. All duplicates were removed. Person #1 has two connections, but only one based on owning a vehicle, so that was kept.

I have the following dplyr code:

df = df %>%
    group_by(id) %>%
    filter(car == 1) %>%
    filter(insurance == "yes") %>%
    filter(ins_type == "full")

But that results in:

id   car insurance ins_type
 2      1       yes     full
 6      1       yes     full

I have also tried

df %>% group_by(id, car) %>% distinct(insurance)

but that results in

id   car insurance
 1     0        no
 1     1       yes
 2     1       yes
 5     1        no
 6     1       yes

The first line should not be there.

I have searched this topic extensively and found a number of answers for the question "how to conditionally filter duplicate rows." Most of them -- such as this and this -- deal with keeping one of the rows with either the highest ir lowest value. Others deal with arbitrary/random filtering. I need to follow the logic above.

Any insights are very welcome.

EDIT

All the answers below are highly satisfactory and solved the problem in their own way. I've voted for @storaged 's one because the heart of the solution for my problem was to use factor levels so as to create a hierarchy. I appreaciate your help and teachings, and hope I can be of help to you or the community one day.

  • Can you add what's your desired output? – sm925 Jan 09 '18 at 15:20
  • 4th box is the desired output, I think – storaged Jan 09 '18 at 15:22
  • The one with ids :- 1, 2, 5, 6? – sm925 Jan 09 '18 at 15:31
  • @suchait: Yes, it's the 4th box. It keeps: a. id=5, who appears once; b. the second entry of id=1, which is the one with the car; c. the entries of id=2 and id=6 that are cars with full insurance. Thanks. – questionMarc Jan 09 '18 at 15:31
  • You want a solution in dplyr only or it would be fine to use `data.table`? – sm925 Jan 09 '18 at 15:32
  • I'm still learning data.table, and understand very little of it. But yes, a DT solution would be fine. However, I asked the question specifically on DPLYR, so I guess it would be fair to mark that one as a solution. Would you agree with that? – questionMarc Jan 09 '18 at 15:39
  • I have added one in `data.table`. I guess you have got one in `dplyr` too. – sm925 Jan 09 '18 at 15:44

3 Answers3

2

I propose the following solution. First take care of the importance of each fields by providing proper gradution. In your example we do

df$ins_type <- factor(df$ins_type, levels=c("", "liab", "full"))

other factors have good order of their levels. Next we can sort all the fields and select last entry in the group

df %>% group_by(id) 
   %>% arrange(sort(car), sort(insurance), sort(ins_type)) 
   %>% do(tail(.,n=1))

However it feels like there might exist more elegant solution

EDIT

If there are much more column names, instead of writing them by hand you can do the following

df %>% group_by(id)
   %>% arrange_(.dots=paste0("sort(", names(df)[-1],")")) 
   %>% do(tail(.,n=1))
storaged
  • 1,837
  • 20
  • 34
  • Thanks, @storaged. I will test and post back. – questionMarc Jan 09 '18 at 15:42
  • 1
    Hi, your code works fine, @storaged. Now I need to test it with the actual dataset. There more variables and more levels it. Thanks for the edit too, it will help on that matter. Please allow me some time to post back. – questionMarc Jan 09 '18 at 16:29
  • 1
    Hi, I've employed your method with my actual data -- which actually had more than 10 categories -- and it worked perfectly. Thanks a million! – questionMarc Jan 20 '18 at 17:10
1

Using data.table:-

library(data.table)
setDT(df)
df[, idx := .N, by = id]
df <- df[!(idx == 2 & car == 0), ]
df[, idx := .N, by = id]
df <- df[!(idx == 2 & ins_type == "liab"), ]
df[, idx := NULL]
df

You'll get your desired output:-

id car insurance ins_type
1:  1   1       yes     liab
2:  2   1       yes     full
3:  5   1        no         
4:  6   1       yes     full

Here's something which I tried in dplyr:-

df <- df %>%
  group_by(id) %>%
  mutate(idx = n()) %>%
  filter((idx == 2 | idx == 1) & car == 1) %>%
  mutate(idx1 = n())


df %>%
  filter(!(idx1 == 2 & ins_type == "liab")) %>%
  select(-one_of(c("idx", "idx1")))

It gives the same output:-

 # A tibble: 4 x 4
# Groups:   id [4]
     id   car insurance ins_type
  <dbl> <dbl>    <fctr>   <fctr>
1     1     1       yes     liab
2     2     1       yes     full
3     5     1        no         
4     6     1       yes     full
sm925
  • 2,648
  • 1
  • 16
  • 28
  • Thank you very much, @suchait. I will test and post back. – questionMarc Jan 09 '18 at 15:45
  • Well, @suchait, your code works fine. If I understood it correctly (sorry, beginner DT here...), you have: 1. created an idx variable with the number of occurrences for each ID; 2. removed people showing twice IF they had NO car in one entry; 3. create idx again for further duplicates; 4. removed people with 2 cars if one was "liab"; 5. removed the IDX variable. Is that correct? – questionMarc Jan 09 '18 at 16:08
  • Well, thanks so much. I will test it for the cases where a person has no car but shows once in the dataset. Please allow me a little time (and for other answers to come through) before I pick a solution. But I appreciate your help and teaching me some DT. :) – questionMarc Jan 09 '18 at 16:16
  • Sure. Check out `dplyr` solution too. – sm925 Jan 09 '18 at 16:17
  • thanks a lot for your help and for the data.table code and lesson. I've decided for storaged's solution and explained it above. But yours worked fine and was very appreciated. – questionMarc Jan 20 '18 at 17:12
1

This is an extension on @storaged answer but all in dplyr chain

df %>% 
   mutate(ins_type = relevel(ins_type, "liab")) %>% 
   group_by(id) %>% 
   arrange(car, insurance, ins_type) %>%      # sort and arrange are redundant
   slice(n())    # equivalent to do(tail(., 1))
CPak
  • 13,260
  • 3
  • 30
  • 48
  • Thanks, @Cpak. I'll test and post back. Can `relevel` be used to order all levels -- or more than one level at a time? For instance: `relevel(ins_type, "liab", "full")`, so that I get "1 liab", "2 full" and "3 NA"? From the documentation I don't see that, and using `ifelse` didn't work either. – questionMarc Jan 09 '18 at 20:38
  • It doesn't seem like it. From docs, `relevel(x, ref, …)`, `**ref** the reference level, typically a string.`, (but I haven't tested it). – CPak Jan 10 '18 at 15:39
  • Dear @Cpak, thanks for your help and neat code. I voted for storaged's solution because I had to employ his method for setting up the levels. But your solution was very elegant and worked for the proposed question flawlessly. Your help was much appreciated. – questionMarc Jan 20 '18 at 17:15