9

I'm trying to implement the Smith-Waterman algorithm for local sequence alignment using the affine gap penalty function. I think I understand how to initiate and compute the matrices required for calculating alignment scores, but am clueless as to how to then traceback to find the alignment. To generate the 3 matrices required I have the following code

for j in range(1, len2):
    for i in range(1, len1):
        fxOpen = F[i][j-1] + gap
        xExtend = Ix[i][j-1] + extend
        Ix[i][j] = max(fxOpen, xExtend)

        fyOpen = F[i-1][j] + gap
        yExtend = Iy[i-1][j] + extend
        Iy[i][j] = max(fyOpen, yExtend)

        matchScore = (F[i-1][j-1]  + simMatrixDict[seq1[i-1]+seq2[j-1]])
        xScore = Ix[i-1][j-1] + simMatrixDict[seq1[i-1]+seq2[j-1]]
        yScore = Iy[i-1][j-1] + simMatrixDict[seq1[i-1]+seq2[j-1]]
        F[i][j] = max(0, matchScore, xScore, yScore)

I am unsure if I need a single matrix for traceback, or just 1? Any clarification on how to go about tracing back from the max score in F would be much appreciated.

jonwells
  • 213
  • 2
  • 7
  • Are you trying to implement the algorithm just as an exercise? You can find Python implementations online. Examples: [one](https://github.com/alevchuk/pairwise-alignment-in-python), [two](https://pypi.python.org/pypi/swalign/0.2), [three](https://github.com/kevinakwok/bioinfo/tree/master/Smith-Waterman), [four](http://forrestbao.blogspot.com/2007/09/smith-waterman-algorithm-in-process.html). – David Cain Aug 07 '13 at 14:41
  • 1
    thanks for the reply but only one of those (two) includes the affine gap penalty function, which is what i'm really after. Unfortunately the code in that is a bit beyond me, only been at it for a couple of months. – jonwells Aug 08 '13 at 09:12

1 Answers1

7

The important thing to remember about traceback in Smith-Waterman is that the matrix a value is in determines the direction that you move. So, if you are in Fyou're moving diagonally, if you're in Ix, you're moving horizontally, and if you're in Iy, you're moving vertically. This means that all you need to store in the pointer matrix is the matrix that you arrived at a square from. The matrix that you are coming from, not the one that you are going to, determines the dirction which to go.

For example:

Say you are at F[5][5]:

  • If pointer matrix says to go to Ix, go to Ix[4][4]
  • If pointer matrix says to go to Iy, go to Iy[4][4]
  • If pointer matrix says to go to F, go to F[4][4]

Whereas if you are at Ix[5][5]:

  • If pointer matrix says to go to Ix, go to Ix[4][5]
  • If pointer matrix says to go to F, go to F[4][5]

Or if you are at Iy[5][5]:

  • If pointer matrix says to go to Iy, go to Iy[5][4]
  • If pointer matrix says to go to F, go to F[5][4]

Assuming that the first index is the x coordinate and the second is the y coordinate.

Continue tracing back until you reach a cell with a maximum value of 0.

Building the pointer matrix: You need one pointer matrix each for F, Ix, and Iy. These matrices only need to indicate which matrix a value came from, because that tells you which direction you were moving in. So, as you're running through the dynamic programming phase of the algorithm, you should also be building up the pointer matrices. Every time you store a new maximum value in a cell in F, Ix, or Iy, you should update the corresponding matrix to indicate where it came from. If, for instance, the highest value you can have in F[5][5] comes by aligning the two next bases when you're in F[4][4], the Fpointer[5][5] should be set to F, because you got there from the F matrix.

Ivan Kush
  • 2,958
  • 2
  • 23
  • 39
seaotternerd
  • 6,298
  • 2
  • 47
  • 58
  • thanks for the quick reply, what I'm struggling with though is how to build the pointer matrix. It seems that the three score matrices are built independently of one another, so I can't see how you would go about deciding when to move from one to another? presumably you would need pointed for left, up, diagonal, then additional pointers telling you which matrix to move to? – jonwells Aug 08 '13 at 09:14
  • 1
    Okay, I edited my answer to include more information about that. Basically, you do need a different pointer matrix for each of your three matrices, but it only needs to record the matrix that you came from when you got the highest value in that cell, because that tells you everything you need to know about movement direction. Since you're asking about traceback, I'm assuming you already have dynamic programming working, so that you can find the best possible valuein each cell. Setting up the pointer matrix is just a matter of keeping track of how you got that value. – seaotternerd Aug 08 '13 at 15:12
  • 1
    I am still in doubt here. If you have the time, could you please show, even if in pseudocode, why three matrices are necessary? The way I thought this was like this: the traceback would simply store directions. I don't really understand why we need to jump to other matrices while tracebacking. When in DP, we store the direction that this value came from so we follow it back (DIAG, LEFT or UP). if the max value of x,y came from F, it's DIAG, if from Ix, LEFT, and so on. I am not saying this is right - I am just confused :) How do I save where i came from and where I am? – francisaugusto Mar 16 '17 at 18:51
  • 1
    The above suggestions really helped me and it works perfectly! Thanks a lot :-) – Srivathsa Nov 08 '17 at 18:11
  • This solution does not guarantee an optimal solution. The optimum path that passed through Ix[5][5] did not necessarily need to pass through either Ix[4][5] or F[4][5]. It can simply come from the Ix[n][5] where n <=4. So you need either an auxiliary structure to store the gap length in the forward pass. A space-efficient solution uses 7 bit arrays [Altschul et al.]. https://www.researchgate.net/publication/19580571_Optimal_sequence_alignment_using_affine_gap_costs – Ruolin Liu Sep 27 '20 at 14:20