Before posting my question, I would just like to mention that I have looked through the "Similar questions" tab, and have not quite found what I am looking for. I found something somewhat similar here, but it is in python. There was also a nice idea here that may help as last resort. In any case, I would like to try first if there is a more straightforward way to do it.
To the problem:
Say there are 2 different data frames: (1) Ref_seq
; and (2) Variants
:
>Ref_seq
Seq_name AA_seq
1 Ref1 VSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQ
2 Ref2 SNFPHLVLEKILVSLTMKNCKAAMNFFQ
3 Ref3 RRQKRPSSGTIFNDAFWLDLNYLEVAKVAQS
4 Ref4 HCTSVSKEVEGTSYHESLYNALQSLRDR
5 Ref5 DHTGEYGNLVTIQSFKAEFRLAGGVNLPKIIDC
6 Ref6 HKDQMVDIMRASQDNPQDGIMVKLVVNLLQLS
7 Ref7 SNILLKDILSVRKYWCEISQQQWLELFSVY
8 Ref8 LTIFLKTLAVNFRIRVCELGDEILPTLLYIWT
9 Ref9 EDQSSMNLFNDYPDSSVSDANEPGESQSTIG
10 Ref10 SLSEKSKEETGISLQDLLLEIYRSIGEPDSL
>Variants
peptideID AA_seq
1 Pep1 QEISALVKYF
2 Pep2 HTGERGNLVT
3 Pep3 NKMTTSVLIK
4 Pep4 SMNLKNDYPD
5 Pep5 NEPGYSQSTI
6 Pep6 NPQDVIMVKL
7 Pep7 MAAKFNKMTL
8 Pep8 RRQKDPSSGT
9 Pep9 QQQWTELFSV
The first data frame contains the amino acid (aa) sequences from a reference organism, whilst the second contains the aa sequences from a test organism. It is known that the sequences from the Variants
object contain at least (a) one aa change, (b) 4 matching characters to the reference sequence from Ref_seq
, and (c) the matching can be forward or backwards (e.g. aa sequence from line 3 of Variants
).
I am trying to find a way to lookup and retrieve to which reference sequence (Seq_name) each peptideID belongs to. The result should look like this:
peptideID AA_seq Seq_name
1 Pep1 QEISALVKYF Ref1
2 Pep2 HTGERGNLVT Ref5
3 Pep3 NKMTTSVLIK Ref2
4 Pep4 SMNLKNDYPD Ref9
5 Pep5 NEPGYSQSTI Ref9
6 Pep6 NPQDVIMVKL Ref6
7 Pep7 MAAKFNKMTL Ref2
8 Pep8 RRQKDPSSGT Ref3
9 Pep9 QQQWTELFSV Ref7
I thought that maybe regex
coupled with a loop for each peptideID, considering that the strings change according to it. But I cannot wrap my head around it.
Any help will be very welcome!
Data from the example:
Ref_seq <- data.frame(Seq_name=paste0("Ref",1:10), AA_seq=c("VSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQ", "SNFPHLVLEKILVSLTMKNCKAAMNFFQ", "RRQKRPSSGTIFNDAFWLDLNYLEVAKVAQS", "HCTSVSKEVEGTSYHESLYNALQSLRDR", "DHTGEYGNLVTIQSFKAEFRLAGGVNLPKIIDC", "HKDQMVDIMRASQDNPQDGIMVKLVVNLLQLS", "SNILLKDILSVRKYWCEISQQQWLELFSVY", "LTIFLKTLAVNFRIRVCELGDEILPTLLYIWT", "EDQSSMNLFNDYPDSSVSDANEPGESQSTIG", "SLSEKSKEETGISLQDLLLEIYRSIGEPDSL"))
Variants <- data.frame(peptideID=paste0("Pep",1:9), AA_seq=c("QEISALVKYF", "HTGERGNLVT", "NKMTTSVLIK", "SMNLKNDYPD", "NEPGYSQSTI", "NPQDVIMVKL", "MAAKFNKMTL", "RRQKDPSSGT", "QQQWTELFSV"))