1

I have a data frame like this one

df2 <- data.frame(chr=c("Chr1", "Chr1","Chr1","Chr1", "Chr1"), start=c(303259, 303259, 141256011, 143116722, 141256011), end=c(11385251, 10779165, 141618035, 156328057,  156328057), chr.2=c("Chr1", "Chr1","Chr1","Chr1", "Chr1"), start.2=c(303259, 303259, 141256011, 141256011, 143116722), end.2=c(10779165, 11385251, 156328057, 156328057, 156328057) ) 

The table looks like this:

   chr     start       end chr.2   start.2     end.2
1 Chr1    303259  11385251  Chr1    303259  10779165
2 Chr1    303259  10779165  Chr1    303259  11385251
3 Chr1 141256011 141618035  Chr1 141256011 156328057
4 Chr1 143116722 156328057  Chr1 141256011 156328057
5 Chr1 141256011 156328057  Chr1 143116722 156328057

As you can see, in this example, row 1 and row 2 are duplicated but in inverse order. I would like to keep only one of those rows. The same happens for rows 4 and 5. Also, if just by chance there is any exactly duplicated row I would like to remove it too.

I would like to obtain something like this:

  chr     start       end chr.2   start.2     end.2
1 Chr1    303259  11385251  Chr1    303259  10779165
3 Chr1 141256011 141618035  Chr1 141256011 156328057
4 Chr1 143116722 156328057  Chr1 141256011 156328057

Do you know how I could achieve this?

Eric González
  • 465
  • 2
  • 10
  • Use `pmin()` and `pmax()` to put the columns in a standard order (maybe create new columns `start.min` and `start.max`, etc), then de-duplicate based on the columns that are in a consistent order. – Gregor Thomas Dec 13 '22 at 03:04
  • Looks to me something went wrong while [`merge`](https://stackoverflow.com/q/1299871/6574038)ing. Looks like an [XY_problem](https://en.wikipedia.org/wiki/XY_problem) to me. – jay.sf Dec 13 '22 at 04:45

1 Answers1

2

Use purrr::map2() to create list-columns containing sorted vectors of “starts” and “ends” for each row, then use dplyr::distinct() to remove duplicates:

library(purrr)
library(dplyr)

df2 %>%
  mutate(
    starts = map2(start, start.2, ~ sort(c(.x, .y))),
    ends = map2(end, end.2, ~ sort(c(.x, .y)))
  ) %>%
  distinct(starts, ends, .keep_all = TRUE) %>%
  select(!starts:ends)
   chr     start       end chr.2   start.2     end.2
1 Chr1    303259  11385251  Chr1    303259  10779165
2 Chr1 141256011 141618035  Chr1 141256011 156328057
3 Chr1 143116722 156328057  Chr1 141256011 156328057
zephryl
  • 14,633
  • 3
  • 11
  • 30