1

I have two dataframes and I'm trying to merge them based on MarkerName, Allele1, and Allele2. My issue is that the allele 1 and allele 2 columns can sometimes be flipped and are different cases (i.e. line 1 in both examples below). Additionally, sometimes the MarkerName can be duplicated (line 3 and 4).

Here are the examples:

 MarkerName  id     Allele1 Allele2
1: 1:752721  rs3131972  A  G
2: 1:791853  rs6684487  G  A
3: 1:834056 rs28482280  A  C
4: 1:834056 rs28482282  A  CAC
5: 1:834059 rs28482242  A  C,G


and

   MarkerName Allele1 Allele2  Effect 
1:   1:752721       g       a -0.0018 
2:   1:791853       g       a  0.0408 
3:   1:834056       a       c  0.0079 

Is there a way to add the id column to the second dataframe in R? So it would look something like this in the end:

   MarkerName Allele1 Allele2  Effect id
1:   1:752721       g       a -0.0018 rs3131972
2:   1:791853       g       a  0.0408 rs6684487
3:   1:834056       a       c  0.0079 rs28482280



statssyboy
  • 89
  • 6
  • You will want to use dplyr, [here](https://stackoverflow.com/questions/26611717/can-dplyr-join-on-multiple-columns-or-composite-key#26611970) is a link that joins using dplyr based on multiple columns – Daniel_j_iii Jun 20 '20 at 03:00
  • Outputs look to be `data.table` ( not `data.frame`). – Parfait Jun 20 '20 at 03:29

1 Answers1

2

You can sort the data in Allele1 and Allele2 using pmin and pmax, use separate_rows to get data in different rows and join them.

library(dplyr)

df1 %>%
   tidyr::separate_rows(Allele2) %>%
   mutate(col1 = tolower(pmin(Allele1, Allele2)), 
          col2 = tolower(pmax(Allele1, Allele2))) %>%
   right_join(df2 %>%
               mutate(col1 = tolower(pmin(Allele1, Allele2)), 
                      col2 = tolower(pmax(Allele1, Allele2))),  
                by = c('MarkerName', 'col1', 'col2')) %>%
    select(MarkerName, id, Allele1 = Allele1.y, Allele2 = Allele2.y, Effect)


#  MarkerName         id Allele1 Allele2  Effect
#1   1:752721  rs3131972       g       a -0.0018
#2   1:791853  rs6684487       g       a  0.0408
#3   1:834056 rs28482280       a       c  0.0079
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you! The issue though is that it rearranges the order of Allele 1 and 2, which means I would also have to flip the sign for the Effect column for those that have been flipped. Is there anyway to merge without having to change the order? – statssyboy Jun 20 '20 at 03:48
  • @Parfait That is because in `mutate` when you run `Allele1 = tolower(pmin(Allele1, Allele2))` the value of `Allele1` is changed in the next line `Allele2 = tolower(pmax(Allele1, Allele2))`. That is not the case in `transform`. Although as OP wants to maintain the original order I have created new columns with `mutate` which does not mess up `Allele1` and `Allele2`. – Ronak Shah Jun 20 '20 at 04:19