1

I want to identify the position and amino acid change in a specific region of interest in my sequences, and store that information in a table.

Is it possible to do something like this in R, by maybe using the bioconductor package? I have managed to do a simple alignment of the sequences with AlignSeqs() from the DECIPHER package, but I am unable to pull out the differences in the sequences automatically. I start with FASTA files.

I want to end up with something like this:

Isolate ID    Reference_AA   Sample_AA   Pos
1             S              T           254
2             T              D           200
3             L              A           230

I have a reference sequence which is 74 AA long, and I want to look at differences in the query sequences (which are much longer than the reference) compared to the reference, and list the results in a table. The position column is related to the position in the reference sequence and not in the query sequence. I want the first AA in the reference sequence to start at 68, not 1.

I find it difficult to add example sequences for this as they tend to be very long, but here is something shorter to work with (not related to the table above):

>ref
VGRALPDVR

>query1
KSSYLDYAMSVIVGTALPDVRDGLKPVHRRVLY

>query2
ELKSSYLDYAMSVIVGRAAPDVRDGLKPV

Expected output:

ID       Reference_AA    Sample_AA   Pos
query1   R               T           70
query2   A               L           72
Haakonkas
  • 961
  • 9
  • 26
  • Can you explain your example, what is ref, query1 and 2? – zx8754 Jan 23 '18 at 13:26
  • @zx8754 Query 1 and 2 are sequences that i want to compare to the reference sequence ref. The differences in the query sequences compared to ref is what i want to list – Haakonkas Jan 23 '18 at 13:32
  • What is the expected output for query1 and 2? – zx8754 Jan 23 '18 at 13:32
  • This question is more strongly related to biological sequence analysis than to programming; I recommend that you post/transfer it to https://bioinformatics.stackexchange.com/ or ask it on https://biostars.org instead (but not both). – Charles Plessy Jan 23 '18 at 13:33
  • @CharlesPlessy Thank you for pointing that out. I will post the question there. – Haakonkas Jan 23 '18 at 13:40
  • Why position is 70 and 72? – zx8754 Jan 23 '18 at 13:41
  • 1
    @zx8754 because the "V" at the start of ref is position 68, which i mentioned in the post. However, the positioning isn't that important, as it can be fixed with Pos + 68 (if "V" is 1) later on – Haakonkas Jan 23 '18 at 14:17

1 Answers1

4

Since you want the position in the reference, you can just use a series of pair-wise alignments to the reference sequence. biostrings includes a mismatchTable function that will give you data frame with all the info you need. With a bit of reformatting using dplyr:

library(Biostrings)
library(dplyr)

seqs<-readAAStringSet("test.fa")

mismatches <- function(query, ref) {
    pairwiseAlignment(ref, query, substitutionMatrix = "BLOSUM50",
                      gapOpening = 3, gapExtension = 1) %>%
      mismatchTable() %>%
      mutate(ID=names(query), 
             Pos=PatternStart+67, 
             Reference_AA=as.character(PatternSubstring),
             Sample_AA=as.character(SubjectSubstring)) %>% 
             select(ID, Reference_AA, Sample_AA, Pos)
}  

bind_rows(mismatches(seqs[2], seqs[1]), mismatches(seqs[2], seqs[1]))

#>      ID Reference_AA Sample_AA Pos
#>1 query1            R         T  70
#>2 query2            L         A  72

EDIT

Here's how to loop over inputs using lapply:

bind_rows(lapply(seq_along(seqs[-1]), function(i) mismatches(seqs[i+1], seqs[1])))
#>   ID Reference_AA Sample_AA Pos
#>1 ref            R         T  70
#>2 ref            L         A  72
heathobrien
  • 1,027
  • 7
  • 11
  • This is great! Is it possible to run a FASTA file with multiple sequences easier than repeating the mismatches function for each sequence? – Haakonkas Jan 24 '18 at 14:08
  • I think ```mismatchTable``` only works on pairwise alignments, but you can use ```lapply``` to loop over all your sequences – heathobrien Jan 24 '18 at 14:27
  • I am getting the following error when i loop over e entries in the fasta file: "Column 'ID' is of unsupported type NULL". Any suggestions? @heathobrien – Haakonkas Feb 01 '18 at 10:41
  • This is the problem explained [here](https://stackoverflow.com/questions/9950144/access-lapply-index-names-inside-fun). I don't know why it worked for me when I tested it last week. I've updated my answer with the answer from the link – heathobrien Feb 01 '18 at 11:55