0

So I have fasta file with DNA sequences and I want to do a pairwise comparison of each DNA sequence.

Fasta file contains something of this form:

>dna1
TAGTACTGACCATGGCGTTTGTTG
>dna2
ACCTTGAGATACAAAACGATTGGACTG
>dna3
GCTTCACTGATGCAGTATTCAATTAACCAG
>dna4
CCACTGGAGCTTTCCAAAGGG
>dna5
TCTGTGGGTCCGGTTGTACAG

My approach was to first create a dictionary out of the fast file of DNA sequences, and then do a pairwise comparison of values in a dictionary to find the %age identity between each pair of sequences!!

I am having trouble doing the pairwise comparison!

My code is as follows:

from collections import OrderedDict
from typing import Dict

# Convert the fasta file to dictionary
DnaName_SYMBOL = '>'
def parse_DNAsequences(filename: str,
                    ordered: bool=False) -> Dict[str, str]:
    # filename: str is the DNA sequence name
    # ordered: bool, Gives us an option to order the resulting dictionary

    result = OrderedDict() if ordered else {}

    last_name = None
    with open(filename) as sequences:
        for line in sequences:
            if line.startswith(DnaName_SYMBOL):
                last_name = line[1:-1]
                result[last_name] = []
            else:
                result[last_name].append(line[:-1])

    for name in result:
        result[name] = ''.join(result[name])

    return result
DNAdict = parse_DNAsequences('output.fas')

This part is where I am having trouble, iterating through the dictionary values:

def PairwiseComparison():
    match = sum(s1 == s2 for s1, s2 in zip(a,b))
    if len(s1) > len(s2):
        lengthchosen = len(s1)
    percentidentity = 100*match/lengthchosen

print('{} vs {} {}%').format(percentidentity)

Output should be of this form:

dna1 vs dna2 90%
dna1 vs dna3 100%
dna2 vs dna3 90%

Other notes are that if we compare 2 dna sequences and one of them has a length bigger than the other then we use that length in calculating the percentage identity between the two (where percentage identity is # of matches/the total length)

  • See https://stackoverflow.com/questions/14421733/global-and-local-variables-in-python – Michael Butscher May 10 '20 at 09:44
  • I am having some trouble understanding your question. Are you asking for a way to create all the possible combinations of the DNAs and then iterate on them? – Mike Xydas May 10 '20 at 09:48
  • @MikeXydas I am having trouble doing a pairwise comparison of each value in the dictionary, for example if my values are a,b, c then comparing a with b, a with c and b with a !! I am not sure how to iterate to do the above and store that comparison answer to later print – Japnit Sethi May 10 '20 at 09:51

1 Answers1

0

I suppose that you successfully create the dictionary of DNAs and I have hardcoded it on my example. The heavy-lifting is done by the itertools.combinations

from itertools import combinations

dna_dict = {
    'dna1': 'TAGTACTGACCATGGCGTTTGTTG',
    'dna2': 'ACCTTGAGATACAAAACGATTGGACTG',
    'dna3': 'GCTTCACTGATGCAGTATTCAATTAACCAG'
}

def PairwiseComparison(d1, d2):
    match = sum(s1 == s2 for s1, s2 in zip(d1,d2))
    if len(d1) > len(d2):
        lengthchosen = len(d1)
    else:
        lengthchosen = len(d2)
    percentidentity = 100*match/lengthchosen

    return percentidentity

# Creates all the possible combinations of the dictionary keys
dna_combinations = combinations(dna_dict, 2)

for dna1, dna2 in dna_combinations:
    percent_identity = PairwiseComparison(dna_dict[dna1], dna_dict[dna2])
    print(f'{dna1} vs {dna2} {percent_identity}%')  # The f-strings need python > 3.6, you can change the format if you have a lower version

If you need any clarifications/additions feel free to add a comment.

Mike Xydas
  • 469
  • 5
  • 12
  • Appreciate the help can you explain the `combinations(dna_dict, 2)` and the `print(f'{dna1} vs {dna2} {percent_identity}%')` -> did not understand f here – Japnit Sethi May 10 '20 at 10:10
  • `combinations(dna_dict, 2)` creates all the possible combinations of size 2 of the keys of the `dna_dict`. It returns `[('dna1', 'dna2'), ('dna1', 'dna3'), ('dna2', 'dna3')]`. I suggest reading the documentation link that I have on my answer and its examples. The `f` on the print creates an f-string. Reading the examples here will help you understand it: https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python – Mike Xydas May 10 '20 at 10:16