0

I'm working with DNA sequence alignments and trying to implement a simple scoring algorithm. Since i have to use a matrix for the calculations, i thought numpy should be way faster than a list of lists, but as I tested both, the python lists seem to be way faster. I found this thread (Why use numpy over list based on speed?) but still; i'm using preallocated numpy vs preallocated lists and list of lists are the clear winners. Here is my code:

Lists

def edirDistance(x, y):
    x_dim = len(x)+1
    y_dim = len(y)+1
    D = []
    for i in range(x_dim):
        D.append([0] * (y_dim))
    
    #Filling the matrix borders
    for i in range(x_dim):
        D[i][0] = i
    for i in range(y_dim):
        D[0][i] = i
    
    for i in range(1, x_dim):
        for j in range(1, y_dim):
            distHor = D[i][j-1] + 1
            distVer = D[i-1][j] + 1
            if x[i-1] == y[j-1]:
                distDiag = D[i-1][j-1]
            else:
                distDiag = D[i-1][j-1] + 1
            D[i][j] = min(distHor, distVer,distDiag)
    return D

Numpy

def NP_edirDistance(x, y):
    x_dim = len(x)+1
    y_dim = len(y)+1
    D = np.zeros((x_dim,y_dim))
    
    #Filling the matrix borders
    for i in range(x_dim):
        D[i][0] = i
    for i in range(y_dim):
        D[0][i] = i
    
    for i in range(1, x_dim):
        for j in range(1, y_dim):
            distHor = D[i][j-1] + 1
            distVer = D[i-1][j] + 1
            if x[i-1] == y[j-1]:
                distDiag = D[i-1][j-1]
            else:
                distDiag = D[i-1][j-1] + 1
            D[i][j] = min(distHor, distVer,distDiag)
    return D

I'm not timing the np import.

a = 'ACGTACGACTATCGACTAGCTACGAA'
b = 'ACCCACGTATAACGACTAGCTAGGGA'
%%time
edirDistance(a, b)

total: 1.41 ms

%%time
NP_edirDistance(a, b)

total: 4.43 ms

Replacing D[i][j] by D[i,j] greatly improved time, but still slower. (Thanks @Learning is a mess !)

total: 2.64 ms

I tested with even larger DNA sequences (around 10.000 letters each) and still lists are winning.

Can someone help me improve timing?

Are lists better for this use?

  • 2
    `numpy` arrays are fast - if you use the whole-array compiled methods. When used iteratively like this, lists are faster. Arrays are not a drop-in replacement. You have to take time to learn how to use them effectively. Otherwise stick with the lists. – hpaulj Nov 29 '21 at 16:54
  • As said by @hpaulj you are not using any vectorization here. Also try replacing each `D[i][j]` by `D[i,j]` for quicker look-ups and assigns. – Learning is a mess Nov 29 '21 at 17:01
  • Plus the zero padding can be done by `D[0] = 0; D[:,0] = 0` which is vectorized and faster than your two for loops. – Learning is a mess Nov 29 '21 at 17:14

1 Answers1

0

One way to have faster run is to use GPU/TPU-aided accelerators such as numba and …. I have tested your codes by that a and b on google colab TPU without using accelerators:

1000 loops, best of 5: 563 µs per loop
1000 loops, best of 5: 1.95 ms per loop    # NumPy

But with using numba as nopython=True, without any changes to your codes:

import numba as nb

@nb.njit()
def edirDistance(x, y):
       .
       .

@nb.njit()
def NP_edirDistance(x, y):
       .
       .

It gets:

1000 loops, best of 5: 213 µs per loop
1000 loops, best of 5: 153 µs per loop    # NumPy

Which will get significant difference between them using huge samples or by improving and vectorizing your NumPy codes. This method results as below for samples with 10000 length:

35.50053691864014
22.95994758605957    # NumPy (seconds)
Ali_Sh
  • 2,667
  • 3
  • 43
  • 66