4

Summary:

I'm beginning with some details about alignment algorithms, and at the end, I ask my question. If you know about alignment algorithm pass the beginning.

Consider we have two strings like:

ACCGAATCGA
ACCGGTATTAAC

There is some algorithms like: Smith-Waterman Or Needleman–Wunsch, that align this two sequence and create a matrix. take a look at the result in the following section:

Smith-Waterman Matrix
§   §   A   C   C   G   A   A   T   C   G   A   
§   0   0   0   0   0   0   0   0   0   0   0   
A   0   4   0   0   0   4   4   0   0   0   4   
C   0   0   13  9   4   0   4   3   9   4   0   
C   0   0   9   22  17  12  7   3   12  7   4   
G   0   0   4   17  28  23  18  13  8   18  13  
G   0   0   0   12  23  28  23  18  13  14  18  
T   0   0   0   7   18  23  28  28  23  18  14  
A   0   4   0   2   13  22  27  28  28  23  22  
T   0   0   3   0   8   17  22  32  27  26  23  
T   0   0   0   2   3   12  17  27  31  26  26  
A   0   4   0   0   2   7   16  22  27  31  30  
A   0   4   4   0   0   6   11  17  22  27  35  
C   0   0   13  13  8   3   6   12  26  22  30  

Optimal Alignments
A   C   C   G   A   -   A   T   C   G   A   
A   C   C   G   G   A   A   T   T   A   A   

Question:

My question is simple, but maybe the answer is not easy as it looks. I want to use a group of character as a single one like: [A0][C0][A1][B1]. But in these algorithms, we have to use individual characters. How can we achieve that?

P.S. Consider we have this sequence: #read #write #add #write. Then I convert this to something like that: #read to A .... #write to B.... #add to C. Then my sequence become to: ABCB. But I have a lot of different words that start with #. And the ASCII table is not enough to convert all of them. Then I need more characters. the only way is to use something like [A0] ... [Z9] for each word. OR to use numbers.

P.S: some sample code for Smith-Waterman is exist in this link

P.S: there is another post that want something like that, but what I want is different. In this question, we have a group of character that begins with a [ and ends with ]. And no need to use semantic like ee is equal to i.

mostafa8026
  • 273
  • 2
  • 12
  • 25
  • I find it hard to understand the problem. Maybe you can try to describe why you want to introduce these grouped characters. – cel Apr 13 '16 at 11:24
  • Almost all of the sequence alignment tools in the market are focused on biological sequences (nucleotides or peptides). In my case, however, sequences are composed of hundreds of distinct elements and they cannot be encoded as ASCII strings. Thanks to [these question](http://stackoverflow.com/questions/11984100/non-biological-sequence-alignment-tool?lq=1) :) – mostafa8026 Apr 13 '16 at 11:28
  • these algorithms are not limited to ascii strings. You can use them with arbitrary alphabets. You just have represent your input sequence in an suitable alphabet. – cel Apr 13 '16 at 11:30
  • Assume we have this sequence: #read #write #add. Then I convert this to something like that: #read to A .... #write to B.... #add to C. Then my sequence become to: ABC. but I have a lot of different word that start with `#`. And the ASCII table is not enough to convert all of them. then I need more characters. the only way is to use something like `[A0] ... [Z9]` for each word. OR to use numbers. – mostafa8026 Apr 13 '16 at 11:36

2 Answers2

2

I adapted this Python implementation (GPL version 3 licensed) of both the Smith-Waterman and the Needleman-Wunsch algorithms to support sequences with multiple character groups:

#This software is a free software. Thus, it is licensed under GNU General Public License.
#Python implementation to Smith-Waterman Algorithm for Homework 1 of Bioinformatics class.
#Forrest Bao, Sept. 26 <http://fsbao.net> <forrest.bao aT gmail.com>

# zeros() was origianlly from NumPy.
# This version is implemented by alevchuk 2011-04-10
def zeros(shape):
    retval = []
    for x in range(shape[0]):
        retval.append([])
        for y in range(shape[1]):
            retval[-1].append(0)
    return retval

match_award      = 10
mismatch_penalty = -5
gap_penalty      = -5 # both for opening and extanding
gap = '----' # should be as long as your group of characters
space = '    ' # should be as long as your group of characters

def match_score(alpha, beta):
    if alpha == beta:
        return match_award
    elif alpha == gap or beta == gap:
        return gap_penalty
    else:
        return mismatch_penalty

def finalize(align1, align2):
    align1 = align1[::-1]    #reverse sequence 1
    align2 = align2[::-1]    #reverse sequence 2

    i,j = 0,0

    #calcuate identity, score and aligned sequeces
    symbol = []
    found = 0
    score = 0
    identity = 0
    for i in range(0,len(align1)):
        # if two AAs are the same, then output the letter
        if align1[i] == align2[i]:                
            symbol.append(align1[i])
            identity = identity + 1
            score += match_score(align1[i], align2[i])

        # if they are not identical and none of them is gap
        elif align1[i] != align2[i] and align1[i] != gap and align2[i] != gap:
            score += match_score(align1[i], align2[i])
            symbol.append(space)
            found = 0

        #if one of them is a gap, output a space
        elif align1[i] == gap or align2[i] == gap:
            symbol.append(space)
            score += gap_penalty

    identity = float(identity) / len(align1) * 100

    print 'Identity =', "%3.3f" % identity, 'percent'
    print 'Score =', score
    print ''.join(align1)
    # print ''.join(symbol)
    print ''.join(align2)


def needle(seq1, seq2):
    m, n = len(seq1), len(seq2)  # length of two sequences

    # Generate DP table and traceback path pointer matrix
    score = zeros((m+1, n+1))      # the DP table

    # Calculate DP table
    for i in range(0, m + 1):
        score[i][0] = gap_penalty * i
    for j in range(0, n + 1):
        score[0][j] = gap_penalty * j
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            match = score[i - 1][j - 1] + match_score(seq1[i-1], seq2[j-1])
            delete = score[i - 1][j] + gap_penalty
            insert = score[i][j - 1] + gap_penalty
            score[i][j] = max(match, delete, insert)

    # Traceback and compute the alignment 
    align1, align2 = [], []
    i,j = m,n # start from the bottom right cell
    while i > 0 and j > 0: # end toching the top or the left edge
        score_current = score[i][j]
        score_diagonal = score[i-1][j-1]
        score_up = score[i][j-1]
        score_left = score[i-1][j]

        if score_current == score_diagonal + match_score(seq1[i-1], seq2[j-1]):
            align1.append(seq1[i-1])
            align2.append(seq2[j-1])
            i -= 1
            j -= 1
        elif score_current == score_left + gap_penalty:
            align1.append(seq1[i-1])
            align2.append(gap)
            i -= 1
        elif score_current == score_up + gap_penalty:
            align1.append(gap)
            align2.append(seq2[j-1])
            j -= 1

    # Finish tracing up to the top left cell
    while i > 0:
        align1.append(seq1[i-1])
        align2.append(gap)
        i -= 1
    while j > 0:
        align1.append(gap)
        align2.append(seq2[j-1])
        j -= 1

    finalize(align1, align2)

def water(seq1, seq2):
    m, n = len(seq1), len(seq2)  # length of two sequences

    # Generate DP table and traceback path pointer matrix
    score = zeros((m+1, n+1))      # the DP table
    pointer = zeros((m+1, n+1))    # to store the traceback path

    max_score = 0        # initial maximum score in DP table
    # Calculate DP table and mark pointers
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            score_diagonal = score[i-1][j-1] + match_score(seq1[i-1], seq2[j-1])
            score_up = score[i][j-1] + gap_penalty
            score_left = score[i-1][j] + gap_penalty
            score[i][j] = max(0,score_left, score_up, score_diagonal)
            if score[i][j] == 0:
                pointer[i][j] = 0 # 0 means end of the path
            if score[i][j] == score_left:
                pointer[i][j] = 1 # 1 means trace up
            if score[i][j] == score_up:
                pointer[i][j] = 2 # 2 means trace left
            if score[i][j] == score_diagonal:
                pointer[i][j] = 3 # 3 means trace diagonal
            if score[i][j] >= max_score:
                max_i = i
                max_j = j
                max_score = score[i][j];

    align1, align2 = [], []    # initial sequences

    i,j = max_i,max_j    # indices of path starting point

    #traceback, follow pointers
    while pointer[i][j] != 0:
        if pointer[i][j] == 3:
            align1.append(seq1[i-1])
            align2.append(seq2[j-1])
            i -= 1
            j -= 1
        elif pointer[i][j] == 2:
            align1.append(gap)
            align2.append(seq2[j-1])
            j -= 1
        elif pointer[i][j] == 1:
            align1.append(seq1[i-1])
            align2.append(gap)
            i -= 1

    finalize(align1, align2)

If we run this with the following input:

seq1 = ['[A0]', '[C0]', '[A1]', '[B1]']
seq2 = ['[A0]', '[A1]', '[B1]', '[C1]']

print "Needleman-Wunsch"
needle(seq1, seq2)
print
print "Smith-Waterman"
water(seq1, seq2)

We get this output:

Needleman-Wunsch
Identity = 60.000 percent
Score = 20
[A0][C0][A1][B1]----
[A0]----[A1][B1][C1]

Smith-Waterman
Identity = 75.000 percent
Score = 25
[A0][C0][A1][B1]
[A0]----[A1][B1]

For the specific changes I made, see: this GitHub repository.

BioGeek
  • 21,897
  • 23
  • 83
  • 145
1

Imagine we have a log file with alphabetic sequences. Like something you said, I converted sequences to A0A1... . For example, if there was a sequence like #read #write #add #write, it converted to A0A1A2A1. Every time, I read two character and compare them but keep score matrix like before. Here is my code in C# for smith-waterman string alignment.
Notice that Cell is a user defined class.

private void alignment()
     {
        string strSeq1;
        string strSeq2;

        string strTemp1;
        string strTemp2;

        scoreMatrix = new int[Log.Length, Log.Length];

        // Lists That Holds Alignments
        List<char> SeqAlign1 = new List<char>();
        List<char> SeqAlign2 = new List<char>();

       for (int i = 0; i<Log.Length; i++ )
        {
            for (int j=i+1 ; j<Log.Length; j++)
            {
                strSeq1 = "--" + logFile.Sequence(i);
                strSeq2 = "--" + logFile.Sequence(j);

                //prepare Matrix for Computing optimal alignment
                Cell[,] Matrix = DynamicProgramming.Intialization_Step(strSeq1, strSeq2, intSim, intNonsim, intGap);

                // Trace back matrix from end cell that contains max score 
                DynamicProgramming.Traceback_Step(Matrix, strSeq1, strSeq2, SeqAlign1, SeqAlign2);

                this.scoreMatrix[i, j] = DynamicProgramming.intMaxScore;

                strTemp1 = Reverse(string.Join("", SeqAlign1));
                strTemp2 = Reverse(string.Join("", SeqAlign2));

            }
        }
}

class DynamicProgramming
{
    public  static Cell[,] Intialization_Step(string Seq1, string Seq2,int Sim,int NonSimilar,int Gap)
    {
        int M = Seq1.Length / 2 ;//Length+1//-AAA    //Changed: /2
        int N = Seq2.Length / 2 ;//Length+1//-AAA

        Cell[,] Matrix = new Cell[N, M];

        //Intialize the first Row With Gap Penality Equal To Zero 
        for (int i = 0; i < Matrix.GetLength(1); i++)
        {
            Matrix[0, i] = new Cell(0, i, 0);

        }

        //Intialize the first Column With Gap Penality Equal To Zero 
        for (int i = 0; i < Matrix.GetLength(0); i++)
        {
            Matrix[i, 0] = new Cell(i, 0, 0);

        }

        // Fill Matrix with each cell has a value result from method Get_Max
        for (int j = 1; j < Matrix.GetLength(0); j++)
        {
            for (int i = 1; i < Matrix.GetLength(1); i++)
            {
                Matrix[j, i] = Get_Max(i, j, Seq1, Seq2, Matrix,Sim,NonSimilar,Gap);
            }
        }

        return Matrix;
    }

    public  static Cell Get_Max(int i, int j, string Seq1, string Seq2, Cell[,] Matrix,int Similar,int NonSimilar,int GapPenality)
    {
        Cell Temp = new Cell();
        int intDiagonal_score;
        int intUp_Score;
        int intLeft_Score;
        int Gap = GapPenality;

        //string temp1, temp2;
        //temp1 = Seq1[i*2].ToString() + Seq1[i*2 + 1]; temp2 = Seq2[j*2] + Seq2[j*2 + 1].ToString();

        if ((Seq1[i * 2] + Seq1[i * 2 + 1]) == (Seq2[j * 2] + Seq2[j * 2 + 1]))  //Changed: +
        {
            intDiagonal_score = Matrix[j - 1, i - 1].CellScore + Similar;
        }
        else
        {
            intDiagonal_score = Matrix[j - 1, i - 1].CellScore + NonSimilar;
        }

        //Calculate gap score
        intUp_Score = Matrix[j - 1, i].CellScore + GapPenality;
        intLeft_Score = Matrix[j, i - 1].CellScore + GapPenality;

        if (intDiagonal_score<=0 && intUp_Score<=0 && intLeft_Score <= 0)
        {
            return Temp = new Cell(j, i, 0);     
        }

        if (intDiagonal_score >= intUp_Score)
        {
            if (intDiagonal_score>= intLeft_Score)
            {
                Temp = new Cell(j, i, intDiagonal_score, Matrix[j - 1, i - 1], Cell.PrevcellType.Diagonal);
            }
            else
            {
                Temp = new Cell(j, i, intDiagonal_score, Matrix[j , i - 1], Cell.PrevcellType.Left);
            }
        }
        else
        {
            if (intUp_Score >= intLeft_Score)
            {
                Temp = new Cell(j, i, intDiagonal_score, Matrix[j - 1, i], Cell.PrevcellType.Above);
            }
            else
            {
                Temp = new Cell(j, i, intDiagonal_score, Matrix[j , i - 1], Cell.PrevcellType.Left);
            }
        }

        if (MaxScore.CellScore <= Temp.CellScore)
        {
            MaxScore = Temp;
        }

        return Temp;
    }

    public static void Traceback_Step(Cell[,] Matrix, string Sq1, string Sq2, List<char> Seq1, List<char> Seq2)
    {
        intMaxScore = MaxScore.CellScore;

        while (MaxScore.CellPointer != null)
        {
            if (MaxScore.Type == Cell.PrevcellType.Diagonal)
            {

                Seq1.Add(Sq1[MaxScore.CellColumn * 2 + 1]);  //Changed: All of the following lines with *2 and +1
                Seq1.Add(Sq1[MaxScore.CellColumn * 2]);
                Seq2.Add(Sq2[MaxScore.CellRow * 2 + 1]);
                Seq2.Add(Sq2[MaxScore.CellRow * 2]);

            }
            if (MaxScore.Type == Cell.PrevcellType.Left)
            {
                Seq1.Add(Sq1[MaxScore.CellColumn * 2 + 1]);
                Seq1.Add(Sq1[MaxScore.CellColumn * 2]);
                Seq2.Add('-');

            }
            if (MaxScore.Type == Cell.PrevcellType.Above)
            {
                Seq1.Add('-');
                Seq2.Add(Sq2[MaxScore.CellRow * 2 + 1]);
                Seq2.Add(Sq2[MaxScore.CellRow * 2]);

            }

            MaxScore = MaxScore.CellPointer;

        }          

    }
}
mostafa8026
  • 273
  • 2
  • 12
  • 25
RaSha
  • 1,356
  • 1
  • 16
  • 30