2

I have a huge data set with repeated information in a variable when this relation is understood in sense ID1, ID2 or ID2, ID1 (those in different columns), also values in the variable can be repeated from different IDs relations. I want to delete same info with reverse ID order like this:

ID1         ID2         value
BHFUD13XG   KLSIENAL1   0.3475
GLADKK123   FBNAFLEL2   0.2956
KLSIENAL1   BHFUD13XG   0.3475
FBNAFLEL2   GLADKK123   0.2956
DGFEORM23   SHDKG14NV   0.3475
SHDKG14NV   DGFEORM23   0.3475

Clean

ID1         ID2         value
BHFUD13XG   KLSIENAL1   0.3475
GLADKK123   FBNAFLEL2   0.2956
DGFEORM23   SHDKG14NV   0.3475

Thanks for your help!

AndrewGB
  • 16,126
  • 5
  • 18
  • 49
  • Do not post your data with an image. – Park Dec 20 '21 at 00:40
  • 1
    We can't copy/paste data from images. Please leave the useful edit as it was, with data in plain text format. – neilfws Dec 20 '21 at 00:41
  • this somehow looks to me that there might be a data preparation step before which might be done differently and more efficiently. ? – tjebo Dec 20 '21 at 01:01

2 Answers2

6

As you said, if data is huge, it's not a good idea to use rowwise, but you may try

library(dplyr)
df %>%
  rowwise %>%
  mutate(key = paste0(sort(c(ID1,ID2)), collapse = "-")) %>%
  select(value, key) %>%
  distinct() %>%
  separate(key, c("ID1", "ID2"), "-")

  value ID1       ID2      
  <dbl> <chr>     <chr>    
1 0.348 BHFUD13XG KLSIENAL1
2 0.296 FBNAFLEL2 GLADKK123
3 0.348 DGFEORM23 SHDKG14NV

or

df %>%
  mutate(id1 = pmin(ID1, ID2), id2 = pmax(ID1, ID2)) %>%
  select(id1, id2, value) %>%
  distinct()

        id1       id2  value
1 BHFUD13XG KLSIENAL1 0.3475
2 FBNAFLEL2 GLADKK123 0.2956
3 DGFEORM23 SHDKG14NV 0.3475

add

df %>%
  mutate(id1 = pmin(ID1, ID2), id2 = pmax(ID1, ID2)) %>%
  distinct(id1, id2, value, .keep_all = T) %>%
  select(-id1, -id2) # and select whatever you want.

        ID1       ID2  value       id1       id2
1 BHFUD13XG KLSIENAL1 0.3475 BHFUD13XG KLSIENAL1
2 GLADKK123 FBNAFLEL2 0.2956 FBNAFLEL2 GLADKK123
3 DGFEORM23 SHDKG14NV 0.3475 DGFEORM23 SHDKG14NV
Park
  • 14,771
  • 6
  • 10
  • 29
  • Thanks, that works great! :) Is possible in the case that I want to keep additional columns information in my data, but delete the rows with the same logic (value, ID1, and ID2) as you propose? – portillanath Dec 20 '21 at 01:07
  • @portillanath add `.keep_all = T` argument then select whatever you want in `select(...)` part. Add `-` in front of column name means remove that column. – Park Dec 20 '21 at 01:14
1
library(dplyr)
library(purrr)

t <- tibble(
  ID1=c('BHFUD13XG','GLADKK123','KLSIENAL1','FBNAFLEL2','DGFEORM23','SHDKG14NV'),
  ID2=c('KLSIENAL1','FBNAFLEL2','BHFUD13XG','GLADKK123','SHDKG14NV','DGFEORM23'),
  value=c(.3475, .2956, .3475, .2956, .3475, .3475)
)

t

  ID1       ID2       value
  <chr>     <chr>     <dbl>
1 BHFUD13XG KLSIENAL1 0.3475
2 GLADKK123 FBNAFLEL2 0.2956
3 KLSIENAL1 BHFUD13XG 0.3475
4 FBNAFLEL2 GLADKK123 0.2956
5 DGFEORM23 SHDKG14NV 0.3475
6 SHDKG14NV DGFEORM23 0.3475

MakeKeys <- function(x) {
  # Replace `utf8ToInt` with appropriate encoding
  m <- map(x, function(s){sum(utf8ToInt(s))})
  flatten_int(m)
}

t <- t %>% mutate(k=MakeKeys(ID1)+MakeKeys(ID2)) %>% 
  distinct(k, .keep_all=TRUE) %>%
  select(-k)

t

  ID1       ID2       value
  <chr>     <chr>     <dbl>
1 BHFUD13XG KLSIENAL1 0.3475
2 GLADKK123 FBNAFLEL2 0.2956
3 DGFEORM23 SHDKG14NV 0.3475