1

I'm working on a script that currently contains multiple lists of DNA sequences (each list has a varying number of DNA sequences) and I need to cluster the sequences in each list based on Hamming Distance similarity. My current implementation of this (very crude at the moment) extracts the first sequence in the list and calculates the Hamming Distance of each subsequent sequence. If it's within a certain Hamming Distance, it appends it to a new list which later is used to remove sequences from the original list as well as store the similar sequences in a defaultdict. See my current implementation of my code below:

def hamming_dist(sequence1, sequence2):
"""
Calculates the hamming distance between 2 sequences
"""
    assert len(sequence1) == len(sequence2)
    return sum(sequence1 !=sequence2 for sequence1,sequence2 \
    in itertools.izip(sequence1,sequence2))


def group_sequences(sequences_list):
    trash_sequences = []
    main_sequence = sequences_list[0]
    clustered_sequence = defaultdict(list)
    while len(sequences_list) > 1:
        for sequence in sequences_list:
            ham_dist = hamming_dist(main_sequence,sequence)
            if hamming_dist < 30:
                trash_sequences.append(sequence)

        for similar_sequences in trash_sequences:
            sequences_list.remove(similar_sequences)
            clustered_sequence[main_tcr].append(similar_sequences)
    else:
        clustered_sequence[main_sequence].append(None)
    return clustered_sequence

Obviously this isn't the best or most efficient way to do it and even with this implementation, there are still some bugs in my script that I encountered. I read through over StackOverflow/StackExchange questions to see if other people have encountered my problem and of the similar questions I found, many other people have mentioned about using algorithms such as the K-Means algorithm, Markov Clustering method, hierarchy clustering, etc. I'm not too familiar with any of these methods except the K-means method which requires numbers, not strings.

Which clustering method(s) would you suggest I implement to cluster similar DNA sequences together as well as the best way to implement your preferred method of choice? Any suggestions would be much appreciated!

John Collins
  • 2,067
  • 9
  • 17
superasiantomtom95
  • 521
  • 1
  • 7
  • 25

2 Answers2

0

The best choice to get started is hierarchical clustering.

It's easy to understand, it allows any distance, and it can visualize the clustering as a tree even when the data itself is hard to visualize.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

I would recommend using the Levenshtein[1, 2] distance metric instead, which is identical to Hamming except that Hamming can only compare two strings of the same length, while Levenshtein distance is not limited by that constraint and thus allows for calculating distances between any collection of sequences (strings).

There is already an excellent Python library which uses C to implement an ultra fast algorithm for calculating Levenshtein distances.

Example bioinformatics multi-sequence set clustering analysis:

  1. Use Levenshtein optimal distance scores as similarity metric. Calculate pairwise among input set of sequences.
  2. AP (Affinity Propagation) clustering inherently determines the most representative number of clusters for given input sequences by arriving at iteratively-optimized "exemplar" representative sequences (one per cluster), using the calculated Levenshtein distances.
  3. Clustering can then be productively interpreted through MSA (Multiple Sequence Alignment; e.g., via clustalo) on each subset of obtained clusters of sequences.
  4. And visualized by hierarchically-clustered heatmaps and Multi-Dimensional Scaling plots

E.g.,:

In these example data visualizations, the sequences are actually protein - but that makes no difference. The input just as well could have been nucleic acid sequences. (The sequences are also intentionally obscured or only partially provided.)

Hierarchical Clustering

Hierarchical Clustermap

Multidimensional Scaling Visualization of Affinity Propagation-obtained Sequence Clusters

MDS Plot

John Collins
  • 2,067
  • 9
  • 17
  • @superasiantomtom95 If you are still interested in this question, I could provide some example Python code demonstrating the outline I've suggested. – John Collins Jul 22 '23 at 16:12