Unique case of finding duplicate values flexibly across columns in R

Question

I have a dataset similar to the following:

df <- data.frame(animal_1 = c("cat", "dog", "mouse", "squirrel"),
                 predation_type = c("eats", "eats", "eaten by", "eats"),
                 animal_2 = c("mouse", "squirrel", "cat", "nuts"))

> df
  animal_1 predation_type animal_2
1      cat           eats    mouse
2      dog           eats squirrel
3    mouse       eaten by      cat
4 squirrel           eats     nuts

I am looking for code that identifies row 1 and row 3 as duplicates since they are showing the same phenomenon (a cat eating a mouse or a mouse being eaten by a cat). I'm not sure how to even ask what kind of duplicate case I'm looking for so I'm hoping someone can help. I've tried combining the text into one column (i.e., "catmouse", "dogsquirrel", etc.) and then inverting the letters but that quickly proved too complex.

Thanks so much for any help you can provide.

`t(apply(df[ , c(1, 3)], 1, sort))` as described e.g. here: [Select equivalent rows A-B & B-A](https://stackoverflow.com/questions/19647875/select-equivalent-rows-a-b-b-a); [Removing duplicate combinations irrespective of order](https://stackoverflow.com/questions/9028369/removing-duplicate-combinations-irrespective-of-order). Also `pmin(df$animal_1, df$animal_2)`; `pmax(df$animal_1, df$animal_2)` — Henrik, Jan 17 '22 at 05:02

Yuriy Saraykin · Accepted Answer · 2022-01-17T18:35:19.513

1

tidyverse

df <- data.frame(animal_1 = c("cat", "dog", "mouse", "squirrel"),
                 predation_type = c("eats", "eats", "eaten by", "eats"),
                 animal_2 = c("mouse", "squirrel", "cat", "nuts"))
library(tidyverse)

df %>% 
  rowwise() %>% 
  mutate(duplicates = str_c(sort(c_across(c(1, 3))), collapse = "")) %>% 
  group_by(duplicates) %>% 
  mutate(duplicates = n() > 1) %>% 
  ungroup()
#> # A tibble: 4 x 4
#>   animal_1 predation_type animal_2 duplicates
#>   <chr>    <chr>          <chr>    <lgl>     
#> 1 cat      eats           mouse    TRUE      
#> 2 dog      eats           squirrel FALSE     
#> 3 mouse    eaten by       cat      TRUE      
#> 4 squirrel eats           nuts     FALSE

^{Created on 2022-01-17 by the reprex package (v2.0.1)}

removing duplicates


library(tidyverse)
df %>% 
  filter(!duplicated(map2(animal_1, animal_2, ~str_c(sort((c(.x, .y))), collapse = ""))))
#>   animal_1 predation_type animal_2
#> 1      cat           eats    mouse
#> 2      dog           eats squirrel
#> 3 squirrel           eats     nuts

^{Created on 2022-01-17 by the reprex package (v2.0.1)}

edited Jan 17 '22 at 18:35

answered Jan 17 '22 at 10:30

Yuriy Saraykin

8,390
1
7
14

Hi Yuriy, thanks so much, your first solution worked perfectly. And I appreciate learning how to do this in Tidyverse. Now that I have identified the duplicates, what would be the best way to eliminate one of each duplicate from the dataframe? – Bradley Allf Jan 17 '22 at 16:20
Ah, wait I think I figured it out with: df %>% rowwise() %>% mutate(duplicates = str_c(sort(c_across(c(1, 3))), collapse = "")) %>% ungroup() %>% group_by(duplicates) %>% filter(row_number() == 1) %>% mutate(duplicates = n() > 1) %>% ungroup() – Bradley Allf Jan 17 '22 at 17:42
Great! If you believe this answer was helpful for you, you could accept it by clicking the tick on the left side of this answer :) – Yuriy Saraykin Jan 17 '22 at 18:16
Ah, great! Done. Very new to this. Thanks! – Bradley Allf Jan 19 '22 at 04:40

Macgregor Aubertin-Young · Answer 2 · 2022-01-17T05:27:17.827

0

You can sort() the dataframe to make duplicated() useful.

newdf = df[, c('animal_1', 'animal_2')]

for (i in 1:nrow(df)){
  newdf[i, ] = sort(df[i,])
}

newdf[!(duplicated(newdf$animal_1) & duplicated(newdf$animal_2)),]

  animal_1 animal_2
1      cat    mouse
2      dog squirrel
4     nuts squirrel

edited Jan 17 '22 at 05:27

answered Jan 17 '22 at 05:02

Macgregor Aubertin-Young

86
3

Thanks for your help! I tried this but was getting warning messages and the output had three null values. – Bradley Allf Jan 17 '22 at 16:23

Unique case of finding duplicate values flexibly across columns in R

2 Answers2