4

I need a methodology for creating a consensus sequence out of 3 - 1000 short (10-20bp) nucleotide ("ATCG") reads of varying lengths.

A simplified example:

"AGGGGC"
"AGGGC"
"AGGGGGC"
"AGGAGC"
"AGGGGG"

Should result in a consensus sequence of "AGGGGC".

I have found modules that do multiple sequence alignment (MSA) in the BioPython library, but only for sequences of the same length. I am also familiar with (and have implemented) Smith-Waterman style alignment for two sequences of any length. I imagine there must be a library or implementation that combine these elements (MSA over unequal lentghs), but after hours of scouring the web and various documentation haven't found anything.

Any advice on existing modules/libraries (Python preferred) or programs I can incorporate into a pipeline that do this?

Thanks!

jasonscript
  • 6,039
  • 3
  • 28
  • 43

1 Answers1

1

Add gap characters at the end of your sequences so that they all have the same length. Then you can process them with your program of choice, e.g. MUSCLE:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align.Applications import MuscleCommandline

sequences = ["AGGGGC",
             "AGGGC",
             "AGGGGGC",
             "AGGAGC",
             "AGGGGG"]

longest_length = max(len(s) for s in sequences)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences]
records = (SeqRecord(Seq(s)) for s in padded_sequences)

SeqIO.write(records, "msa_example.fasta", "fasta")

from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="msa_example.fasta", out="msa.txt")
print cline
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • 1
    Muscle works fine with sequences of unequal length. No need to pad them with gaps – heathobrien Jul 01 '15 at 09:05
  • MAFFT is faster than muscle but either should be fine with the OP's input data size – dkatzel Jul 01 '15 at 15:16
  • Thanks, both MAFFT and Muscle meet my need. However, they both produce unexpected results: 1 aggggc- 4 aggagc- 3 agggggc 5 aggggg- 2 a-gggc- * *.* I'm thinking the gap cost is too high, and so it's not creating the "-" in 1, 4, 5, and 2 to align the "c" nucleotides in a column. I see in MAFFT there are various -op flags to change this weight, any suggestion of which is correct or a starting value for tinkering? As far as MUSCLE goes, I don't see any options to change this weight? – Alexander West Jul 01 '15 at 22:22
  • Played around with the MAFFT settings. I can't seem to find any where it will align the "A"s and "C"s and treat the variation in number of "G"s as indels. – Alexander West Jul 01 '15 at 23:06