Improving code design of DNA alignment degapping

Question

This is a question regarding a more efficient code design:

Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.

# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}

I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.

# Desired output
align = {"seq1":"ATGCATGC", 
         "seq2":"ATGC",
         "seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,1], "gene2":[2,3]},
         "seq3":{"gene1":[0,1], "gene2":[2,3]}}

Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.

1  measure length of any aligned gene  # take any seq, since all seqs aligned
2  list_lengths = list of gene lengths  # order is important
3  for seq in alignment
4      outseq = ""
5      for each num in range(0, length(seq))  # weird for-loop is intentional
6          if seq[num] == "-"
7              current_gene = gene whose start/stop positions include num
8              subtract 1 from length of current_gene
9              subtract 1 from lengths of all genes following current_gene in list_lengths
10         else
11             append seq[num] to outseq
12     append outseq to new variable
13     convert gene lengths into start/stop positions and append ordered to new variable

Can anyone give me suggestions/examples for a shorter, more direct code design?

Looking for a Pythonian solution, this better be tagged _python_ - I'd drop pseudocode. You already "rebased" your arrays from 1 to 0: consider representing ranges/slices including _[from, to)_ excluded. The "association" of _anno_ and _align_ via "label" looks slight. - You need to specify allowed overlap or gaps between _genes_, if any. Keeping _genes_ in _annos_ ordered should help - specify! 8&9 may be too detailed. Educated guess: depending on representation, 13 is about half of your complexity - expand. (Once you got code _and_ still see issues, consider presenting this at CODE REVIEW.) — greybeard, Jan 20 '16 at 17:06
@greybeard I changed the tags as you suggested. Changes of the pseudocode according to your suggestions (especially line 13) is forthcoming. — Michael Gruenstaeudl, Jan 20 '16 at 18:15
Just to clarify, is this what your data means? For sequence `AT----GC`, the `"gene1":[0,3], "gene2":[4,7]` indicates that gene1 is `AT--`, which can be shortened to `AT`, and gene2 is `--GC`, which can be shortened to `GC`? — , Jan 20 '16 at 18:41
Followup question, is the input/output format fixed or does it just need to contain the same data? It's easier to write a Pythonic solution if the format is flexible. — , Jan 20 '16 at 19:07
@Pausbrak Regarding your first question: Yes, your illustration of "shortening" (which bioinformaticians would call _degapping_) is correct. The crux is to have the annotations still be correct after this degapping step. — Michael Gruenstaeudl, Jan 20 '16 at 22:43
@Pausbrak Regarding your follow-up question: The output (i.e., degapped sequences and updated annotations) should again be in the form of a Python dictionary, if at all possible. — Michael Gruenstaeudl, Jan 20 '16 at 22:47
In your comment to cdlane's answer, your second `annos` -- In the case of `seq2` - `gene2` do you want there to be no record in the `annos` dictionary? — Kevin, Jan 24 '16 at 15:25

Kevin · Accepted Answer · 2016-01-25T17:06:08.463

This answer handles your updated annos dictionary from the comment to cdlanes answer. That answer leaves the annos dictionary with the incorrect index of [2,1] for seq2 gene2. My proposed solution will remove the gene entry from the dictionary if the sequence contains ALL gaps in that region. Also to note, if a gene contains only one letter in the final align, then anno[geneX] will have equal indices for start and stop --> See seq3 gene1 from your commented annos.

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}


annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
          "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
          "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}

import re
for name,anno in annos.items():
    # indices of gaps removed usinig re
    removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]

    # removes gaps from align dictionary
    align[name] = re.sub(r'-', '', align[name])

    build_dna = ''
    for gene,inds in anno.items():

        start_ind = len(build_dna)+1

        #generator to sum the num '-' removed from gene
        num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])

        # build the de-gapped string
        build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")

        end_ind = len(build_dna)

        if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
            del annos[name][gene] #remove the gene entry
            continue
        #update the values in the annos dictionary
        annos[name][gene][0] = start_ind-1
        annos[name][gene][1] = end_ind-1

Results:

In [3]: annos
Out[3]:  {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
          'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
          'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}

Results from the 3 gene annos above. Just replace the annos variable:

In [5]: annos3
Out[5]:  {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
          'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
          'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}

@Kevin The automatic removal of annotations that would consist of gaps only was not a feature I had considered to begin with. Yet, such cases are not uncommon, and good code should be able to handle it. Thanks for pointing it out. — Michael Gruenstaeudl, Jan 25 '16 at 22:45

cdlane · Answer 2 · 2016-01-20T23:59:07.440

The following matches the output of example program for both test cases:

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
         "seq2":{"gene1":[0,3], "gene2":[4,7]},
         "seq3":{"gene1":[0,3], "gene2":[4,7]}}

(START, STOP) = (0, 1)

for alignment, sequence in align.items():
    new_sequence = ''
    gap = 0

    for position, codon in enumerate(sequence):
        if '-' == codon:
            for gene in annos[alignment].values():
                if gene[START] > (position - gap):
                    gene[START] -= 1
                if gene[STOP] >= (position - gap):
                    gene[STOP] -= 1
            gap += 1
        else:
            new_sequence += codon

    align[alignment] = new_sequence

The result of running this:

python3 -i test.py
>>> align
{'seq2': 'ATGC', 'seq1': 'ATGCATGC', 'seq3': 'ACAC'}
>>> 
>>> annos
{'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}, 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]}, 'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
>>>

I hope this is still elegant, direct, short and Pythonic enough!

@cdlane Your solution definitely appears more direct and Pythonic than my convoluted attempt. If you can fix the off-by-one error (which seems to be systematic; see below), I'd be happy to select your answer. As for the systematic error: Please test your code with the following example annos and see the incorrect lower bound for _gene3_ in _seq2_ and _seq3_: `annos = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}` — Michael Gruenstaeudl, Jan 20 '16 at 22:38
@MichaelGruenstaeudl I think I've addressed the off-by-one error and the output matches what your code generates. Enjoy! — cdlane, Jan 20 '16 at 23:55

score 0 · Answer 3 · answered Jan 20 '16 at 22:18

My own solution to the above question is neither elegant nor Pythonic, but arrives at the desired output. Any recommendations for improvement are highly welcome!

import collections
import operator
# measure length of any aligned gene  # take any seq, since all seqs aligned
align_len = len(align.itervalues().next())
# initialize output
align_out, annos_out = {}, {}
# loop through annos
for seqname, anno in annos.items():
# operate on ordered sequence lengths instead on ranges
    ordseqlens = collections.OrderedDict()
# generate ordered sequence lengths
    for k,v in sorted(anno.items(), key=operator.itemgetter(1)):
        ordseqlens[k] = v[1]-v[0]+1
# start (and later append to) sequence output
    align_out[seqname] = ""
# generate R-style for-loop
    for pos in range(0, len(align[seqname])):
        if align[seqname][pos] == "-":
            try:
                current_gene = next(key for key, a in anno.items() if a[0] <= pos <= a[1])
            except StopIteration:
                print("No annotation provided for position", pos, "in sequence", seqname)
# subtract 1 from lengths of current_gene
            ordseqlens[current_gene] = ordseqlens[current_gene]-1
# append nucleotide unless a gap
        else:
            align_out[seqname] += align[seqname][pos]
# convert modified ordered sequence lengths back into start/stop positions
    summ = 0
    tmp_dict = {}
    for k,v in ordseqlens.items():
        tmp_dict[k] = [summ, v+summ-1]
        summ = v+summ
# save start/stop positions to correct annos
    annos_out[seqname] = tmp_dict

The output of this code is:

>>> align_out
{'seq3': 'ACAC',
 'seq2': 'ATGC',
 'seq1': 'ATGCATGC'}

>>> annos_out
{'seq3': {'gene1': [0, 1], 'gene2': [2, 3]},
 'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
 'seq1': {'gene1': [0, 3], 'gene2': [4, 7]}}

Jared Goguen · Answer 4 · 2016-01-21T18:23:57.757

So, I think that the approach of trying to break each sequence up into genes and then remove the dashes is resulting in a lot of unnecessary book-keeping. Instead, it might be easier to look at the dashes directly and then update all of the indices based on their relative positions. Here's a function I wrote that appears to be operating correctly:

from copy import copy

def rewriteGenes(align, annos):
    alignments = copy(align)
    annotations = copy(annos)

    for sequence, alignment in alignments.items():
        while alignment.find('-') > -1:
            index = alignment.find('-')
            for gene, (start, end) in annotations[sequence].items():
                if index < start: 
                    annotations[sequence][gene][0] -= 1
                if index <= end: 
                    annotations[sequence][gene][1] -= 1
            alignment = alignment[:index] + alignment[index+1:]
        alignments[sequence] = alignment

    return (alignments, annotations)

This iterates over the dashes in each alignment and updates the gene indices as they are removed.

Note that this produces a gene with indices [2,1] for the following test case:

align = {"seq1":"ATGCATGC",
         "seq2":"AT----GC",
         "seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
         "seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}, 
         "seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}

This is necessary because the way your indices are setup do not otherwise allow for empty genes. For example, the indices [2,2] would be the sequence of length 1 starting at index 2.

Improving code design of DNA alignment degapping

4 Answers4

Linked