This is a question regarding a more efficient code design:
Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.
# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.
# Desired output
align = {"seq1":"ATGCATGC",
"seq2":"ATGC",
"seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,1], "gene2":[2,3]},
"seq3":{"gene1":[0,1], "gene2":[2,3]}}
Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.
1 measure length of any aligned gene # take any seq, since all seqs aligned
2 list_lengths = list of gene lengths # order is important
3 for seq in alignment
4 outseq = ""
5 for each num in range(0, length(seq)) # weird for-loop is intentional
6 if seq[num] == "-"
7 current_gene = gene whose start/stop positions include num
8 subtract 1 from length of current_gene
9 subtract 1 from lengths of all genes following current_gene in list_lengths
10 else
11 append seq[num] to outseq
12 append outseq to new variable
13 convert gene lengths into start/stop positions and append ordered to new variable
Can anyone give me suggestions/examples for a shorter, more direct code design?