Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
15
votes
2 answers

Why is the Vision framework unable to align two images?

I'm trying to take two images using the camera, and align them using the iOS Vision framework: func align(firstImage: CIImage, secondImage: CIImage) { let request = VNTranslationalImageRegistrationRequest( targetedCIImage: firstImage) { …
9
votes
0 answers

How to align long texts?

I want to align a pair of long texts with ~20M chars each. I've used in the past Smith-Waterman algorithm but (from my limited understanding) it requires creating a 2-dimensional array with the size of the texts (20M by 20M array) - which is not…
Omri
  • 887
  • 8
  • 24
9
votes
1 answer

Traceback in Smith-Wateman algorithm with affine gap penalty

I'm trying to implement the Smith-Waterman algorithm for local sequence alignment using the affine gap penalty function. I think I understand how to initiate and compute the matrices required for calculating alignment scores, but am clueless as to…
jonwells
  • 213
  • 2
  • 7
7
votes
1 answer

Implementing the Waterman-Eggert algorithm

I am trying to implement the Waterman-Eggert algorithm for finding sub-optimal local sequence alignments, but am struggling to understand how to "declump" separate groups of alignments. I have the basic Smith-Waterman algorithm working fine. A…
jonwells
  • 213
  • 2
  • 7
6
votes
1 answer

How to merge strings with overlapping characters in python?

I'm working on a python project which reads in an URL encoded overlapping list of strings. Each string is 15 characters long and overlaps with its sequential string by at least 3 characters and at most 15 characters (identical). The goal of the…
5
votes
1 answer

Minimal bitstring set union with shifts algorithm

I'm looking for an algorithm to solve, or at least a proper name for the following problem: I have a set B of bitstrings. The algorithm should find a minimal (defined as "having fewest bits set") bitstring S such that: For all b in B, there exists…
AlliedEnvy
  • 376
  • 1
  • 5
5
votes
1 answer

R EnvironmentError: Could not find Ghostscript on path. RWebLogo

I ran into an odd issue that I cannot fix in any way and I was hoping someone here may have a better understanding of whats wrong; I am unable to use RWebLogo package - even run the simplest examples due to the same missing Ghostscript error. e.g.…
alexo4
  • 351
  • 3
  • 8
5
votes
3 answers

How to handle multiple optimal edit paths implementing Needleman-Wunsche algorithm?

Trying to implement Needleman-Wunsche algorithm for biological sequences comparison. In some circumstances there exist multiple optimal edit paths. What is the common practice in bio-seq-compare tools handling this? Any priority/preferences among…
Frozen Flame
  • 3,135
  • 2
  • 23
  • 35
4
votes
0 answers

string similarity of optimal alignment

Expected Behaviour of the algorithm I have two strings a and b, with a being the shorter string. I would like to find the substring of b, that has the biggest similarity to a. The substring has to be of len(a), or has to be placed at the end of…
4
votes
2 answers

Sequence Alignment Algorithm with a group of characters instead of one character

Summary: I'm beginning with some details about alignment algorithms, and at the end, I ask my question. If you know about alignment algorithm pass the beginning. Consider we have two strings like: ACCGAATCGA ACCGGTATTAAC There is some algorithms…
mostafa8026
  • 273
  • 2
  • 12
  • 25
4
votes
1 answer

Multiple Sequence Alignment with Unequal String Length

I need a methodology for creating a consensus sequence out of 3 - 1000 short (10-20bp) nucleotide ("ATCG") reads of varying lengths. A simplified example: "AGGGGC" "AGGGC" "AGGGGGC" "AGGAGC" "AGGGGG" Should result in a consensus sequence of…
4
votes
2 answers

Stacking boxes algorithm

I'm a little stuck trying to solve this problem. The main source of confusion comes from not knowing when to remove a box. Here's my approach: I look at the container column by column. If the top most box of the origin box is empty and the…
4
votes
1 answer

Determining ideal number of clusters for sequence(distance)- based clustering

I have written these functions for clustering sequence-based data: library(TraMineR) library(cluster) clustering <- function(data){ data <- seqdef(data, left = "DEL", gaps = "DEL", right = "DEL") couts <- seqsubm(data, method = "CONSTANT") …
histelheim
  • 4,938
  • 6
  • 33
  • 63
4
votes
2 answers

global sequence alignment dynamic programming finding the minimum in a matrix

I have 2 sequences, AACAGTTACC and TAAGGTCA, and I'm trying to find a global sequence alignment. I managed to create a 2D array and create the matrix, and I even filled it with semi-dynamic approach. Here is my code to fill the matrix: void…
judge
  • 195
  • 1
  • 4
  • 14
4
votes
2 answers

global alignment using gap penalty function

can any body help me with the following problem?! For a parameter k, compute the global alignment between two strings, subject to the constraint that the alignment contains at most k gaps (blocks of consecutive indels).
arman imen
  • 41
  • 1
1
2 3
8 9