I need a methodology for creating a consensus sequence out of 3 - 1000 short (10-20bp) nucleotide ("ATCG") reads of varying lengths.
A simplified example:
"AGGGGC"
"AGGGC"
"AGGGGGC"
"AGGAGC"
"AGGGGG"
Should result in a consensus sequence of "AGGGGC"
.
I have found modules that do multiple sequence alignment (MSA) in the BioPython library, but only for sequences of the same length. I am also familiar with (and have implemented) Smith-Waterman style alignment for two sequences of any length. I imagine there must be a library or implementation that combine these elements (MSA over unequal lentghs), but after hours of scouring the web and various documentation haven't found anything.
Any advice on existing modules/libraries (Python preferred) or programs I can incorporate into a pipeline that do this?
Thanks!