Looking for occurrences of a pattern on each row of a matrix, I found that there was not clear solution to do it on python for very big matrix having a good performance.
I have a matrix similar to
matrix = np.array([[0,1,1,0,1,0],
[0,1,1,0,1,0]])
print 'matrix: ', matrix
where I want to check the occurreces of patterns [0,0], [0,1] [1,0] and [1,1] on each rowconsidering overlapping. For the example given, where both rows are equal,ther result is equal for each pattern:
- pattern[0,0] = [0,0]
- pattern[0,1] = [2,2]
- pattern[1,0] = [2,2]
- pattern[1,1] = [1,1]
The matrix in this example is quite small, but I am looking for performance as I have a huge matrix. You can test matrix with matrix = numpy.random.randint(2, size=(100000,10))
or bigger for example to see the differences
First I though on a possible answer converting rows to strings and looking for occurrences based on this answer (string count with overlapping occurrences):
def string_occurrences(matrix):
print '\n===== String count with overlapping ====='
numRow,numCol = np.shape(matrix)
Ocur = np.zeros((numRow,4))
for i in range(numRow):
strList = ''.join(map(str,matrix[i,:]))
Ocur[i,0] = occurrences(strList,'00')
Ocur[i,1] = occurrences(strList,'01')
Ocur[i,2] = occurrences(strList,'10')
Ocur[i,3] = occurrences(strList,'11')
return Ocur
using the function occurrences
of the answer
def occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
but considering that the real array is huge, this solution is very very slow as it uses for loops, strings,...
So looking for a numpy solution I used a trick to compare the values with a pattern and roll the matrix on axis=1
to check all the occurrences.
I call it pseudo rolling window on 2D as the window is not square and the way of calculation is different. There are 2 options, where the second (Option 2) is faster because it avoids the extra calculation of numpy.roll
def pseudo_rolling_window_Opt12(matrix):
print '\n===== pseudo_rolling_window ====='
numRow,numCol = np.shape(matrix)
Ocur = np.zeros((numRow,4))
index = 0
for i in np.arange(2):
for j in np.arange(2):
#pattern = -9*np.ones(numCol) # Option 1
pattern = -9*np.ones(numCol+1) # Option 2
pattern[0] = i
pattern[1] = j
for idCol in range(numCol-1):
#Ocur[:,index] += np.sum(np.roll(matrix,-idCol, axis=1) == pattern, axis=1) == 2 # Option 1: 219.398691893 seconds (for my real matrix)
Ocur[:,index] += np.sum(matrix[:,idCol:] == pattern[:-(idCol+1)], axis=1) == 2 # Option 2: 80.929688930 seconds (for my real matrix)
index += 1
return Ocur
Searching for other possibilities, I found the "rolling window" which seemed to be a god answer for performance as it used the numpy function. Looking to this answer (Rolling window for 1D arrays in Numpy?) and the links on it, I checked the following function. But really, I do not understand the output as it seems that the calculations of the window are matching what I was expecting for result.
def rolling_window(a, size):
shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
Used as:
a = rolling_window(matrix, 2)
print a == np.array([0,1])
print np.all(rolling_window(matrix, 2) == [0,1], axis=1)
Does someone know what is wrong on this last case? Or could be any possibility with better performance?