2

Looking for occurrences of a pattern on each row of a matrix, I found that there was not clear solution to do it on python for very big matrix having a good performance.

I have a matrix similar to

matrix = np.array([[0,1,1,0,1,0],
                         [0,1,1,0,1,0]])
print 'matrix: ', matrix

where I want to check the occurreces of patterns [0,0], [0,1] [1,0] and [1,1] on each rowconsidering overlapping. For the example given, where both rows are equal,ther result is equal for each pattern:

  • pattern[0,0] = [0,0]
  • pattern[0,1] = [2,2]
  • pattern[1,0] = [2,2]
  • pattern[1,1] = [1,1]

The matrix in this example is quite small, but I am looking for performance as I have a huge matrix. You can test matrix with matrix = numpy.random.randint(2, size=(100000,10)) or bigger for example to see the differences

First I though on a possible answer converting rows to strings and looking for occurrences based on this answer (string count with overlapping occurrences):

def string_occurrences(matrix):
    print '\n===== String count with overlapping ====='
    numRow,numCol = np.shape(matrix)
    Ocur = np.zeros((numRow,4))
    for i in range(numRow):
        strList = ''.join(map(str,matrix[i,:]))
        Ocur[i,0] = occurrences(strList,'00')
        Ocur[i,1] = occurrences(strList,'01')
        Ocur[i,2] = occurrences(strList,'10')
        Ocur[i,3] = occurrences(strList,'11')
    return Ocur

using the function occurrences of the answer

def occurrences(string, sub):
    count = start = 0
    while True:
        start = string.find(sub, start) + 1
        if start > 0:
            count+=1
        else:
            return count

but considering that the real array is huge, this solution is very very slow as it uses for loops, strings,... So looking for a numpy solution I used a trick to compare the values with a pattern and roll the matrix on axis=1 to check all the occurrences. I call it pseudo rolling window on 2D as the window is not square and the way of calculation is different. There are 2 options, where the second (Option 2) is faster because it avoids the extra calculation of numpy.roll

def pseudo_rolling_window_Opt12(matrix):
    print '\n===== pseudo_rolling_window ====='
    numRow,numCol = np.shape(matrix)
    Ocur = np.zeros((numRow,4))
    index = 0
    for i in np.arange(2):
        for j in np.arange(2):
            #pattern = -9*np.ones(numCol)   # Option 1
            pattern = -9*np.ones(numCol+1)  # Option 2
            pattern[0] = i
            pattern[1] = j
            for idCol in range(numCol-1):
                #Ocur[:,index] += np.sum(np.roll(matrix,-idCol, axis=1) == pattern, axis=1) == 2    # Option 1: 219.398691893 seconds (for my real matrix)
                Ocur[:,index] += np.sum(matrix[:,idCol:] == pattern[:-(idCol+1)], axis=1) == 2      # Option 2:  80.929688930 seconds (for my real matrix)
            index += 1
    return Ocur

Searching for other possibilities, I found the "rolling window" which seemed to be a god answer for performance as it used the numpy function. Looking to this answer (Rolling window for 1D arrays in Numpy?) and the links on it, I checked the following function. But really, I do not understand the output as it seems that the calculations of the window are matching what I was expecting for result.

def rolling_window(a, size):
    shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

Used as:

a = rolling_window(matrix, 2)
print a == np.array([0,1])
print np.all(rolling_window(matrix, 2) == [0,1], axis=1)

Does someone know what is wrong on this last case? Or could be any possibility with better performance?

Community
  • 1
  • 1
iblasi
  • 1,269
  • 9
  • 21

1 Answers1

1

You are using the wrong axis of the numpy array. You should change the axis in np.all from 1 to 2. Using the following code:

a = rolling_window(matrix, 2)
print np.all(rolling_window(matrix, 2) == [0,1], axis=2)

you get:

>>>[[ True False False  True False]
    [ True False False  True False]]

So, in order to get the results you are looking for:

print np.sum(np.all(rolling_window(matrix, 2) == [0,1], axis=2),axis=1)

>>>[2 2]
charles
  • 713
  • 1
  • 6
  • 16
  • 1
    Totally true... After looking for the numpy shape of the output that was (x,y,z) I didn't realize that axis was set to 1. Thank you for that. Just for curiosity for other users, the rolling_window option reduces 50% the execution time of the pseudo rolling window that I was using, so perfect choice for a good performance!!! – iblasi Jan 27 '16 at 09:28