5

I have a matrix M1 , each row of which is a time-dependent signal.

And I have another matrix, M2, of the same dimensions, each row of which is also a time dependent signal, used as a "template", to recognize signalshapes in the first matrix.

I want as a result a column vector v, with v [i] is the correllation between the i'th row of M1 and the i'th row of M2.

I've looked into the corrcoef function of numpy and tried the following code:

import numpy as np

M1 = np.array ([
    [1, 2, 3, 4],
    [2, 3, 1, 4]
])

M2 = np.array ([
    [10, 20, 30, 40],
    [20, 30, 10, 40]
])

print (np.corrcoef (M1, M2))

which prints:

[[ 1.   0.4  1.   0.4]
 [ 0.4  1.   0.4  1. ]
 [ 1.   0.4  1.   0.4]
 [ 0.4  1.   0.4  1. ]]

I've been reading the docs, but I am still confused as to which entries of this matrix I have to pick as entries of my vector v.

Can anyone help?

(I've studied several S.O. answers to similar questions, but didn't yet see the light...)

Code context:

There are 256 rows (signals), and I run a sliding window of 200 samples over the 'main signal', which has a lenght of 10k samples. So M1 and M2 are both 256 rows x 200 columns. Sorry for the erroneous 10k samples. That's the total signal length. By using correlation with a sliding template I try to find the offsets where the template matches best. Actually I am looking for QRS complexes in a 256 channel invasive cardiogram (or rather, electrogram, as physicians call it).

    lg.info ('Processor: {}, time: {}, markers: {}'.format (self.key, dt.datetime.now ().time (), len (self.data.markers)))

    # Compute average signal shape over preexisting markers and uses that as a template to find the others.
    # All generated markers will have the width of the widest preexisting one.

    template = np.zeros ((self.data.samples.shape [0], self.bufferWidthSteps))

    # Add intervals that were marked in advance
    nrOfTerms = 0
    maxWidthSteps = 0
    newMarkers = []
    for marker in self.data.markers:
        if marker.key == self.markerKey:

            # Find start and stop sample index    
            startIndex = marker.tSteps - marker.stampWidthSteps // 2
            stopIndex = marker.tSteps + marker.stampWidthSteps // 2

            # Extract relevant slice from samples and add it to template
            template += np.hstack ((self.data.samples [ : , startIndex : stopIndex], np.zeros ((self.data.samples.shape [0], self.bufferWidthSteps - marker.stampWidthSteps))))

            # Adapt nr of added terms to facilitate averaging
            nrOfTerms += 1

            # Remember maximum width of previously marked QRS complexes
            maxWidthSteps = max (maxWidthSteps, marker.stampWidthSteps)
        else:
            # Preexisting markers with non-matching keys are just copied to the new marker list
            # Preexisting markers with a matching key are omitted from the new marker list
            newMarkers.append (marker)

    # Compute average of intervals that were marked in advance
    template = template [ : , 0 : maxWidthSteps] / nrOfTerms
    halfWidthSteps = maxWidthSteps // 2

    # Append markers of intervals that yield an above threshold correlation with the averaged marked intervals
    firstIndex = 0
    stopIndex = self.data.samples.shape [1] - maxWidthSteps
    while firstIndex < stopIndex:
        corr = np.corrcoef (
            template,
            self.data.samples [ : , firstIndex : firstIndex + maxWidthSteps]
        )

        diag = np.diagonal (
            corr,
            template.shape [0]
        )

        meanCorr = np.mean (diag)

        if meanCorr > self.correlationThreshold:
            newMarkers.append ([self.markerFactories [self.markerKey] .make (firstIndex + halfWidthSteps, maxWidthSteps)])

            # Prevent overlapping markers
            firstIndex += maxWidthSteps
        else:
            firstIndex += 5

    self.data.markers = newMarkers

    lg.info ('Processor: {}, time: {}, markers: {}'.format (self.key, dt.datetime.now ().time (), len (self.data.markers)))
Jacques de Hooge
  • 6,750
  • 2
  • 28
  • 45

3 Answers3

2

Based on this solution to finding correlation matrix between two 2D arrays, we can have a similar one for finding correlation vector that computes correlation between corresponding rows in the two arrays. The implementation would look something like this -

def corr2_coeff_rowwise(A,B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1)[:,None]
    B_mB = B - B.mean(1)[:,None]

    # Sum of squares across rows
    ssA = (A_mA**2).sum(1);
    ssB = (B_mB**2).sum(1);

    # Finally get corr coeff
    return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)

We can further optimize the part to get ssA and ssB by introducing einsum magic there too!

def corr2_coeff_rowwise2(A,B):
    A_mA = A - A.mean(1)[:,None]
    B_mB = B - B.mean(1)[:,None]
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    ssB = np.einsum('ij,ij->i',B_mB,B_mB)
    return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)

Sample run -

In [164]: M1 = np.array ([
     ...:     [1, 2, 3, 4],
     ...:     [2, 3, 1, 4.5]
     ...: ])
     ...: 
     ...: M2 = np.array ([
     ...:     [10, 20, 33, 40],
     ...:     [20, 35, 15, 40]
     ...: ])
     ...: 

In [165]: corr2_coeff_rowwise(M1, M2)
Out[165]: array([ 0.99411402,  0.96131896])

In [166]: corr2_coeff_rowwise2(M1, M2)
Out[166]: array([ 0.99411402,  0.96131896])

Runtime test -

In [97]: M1 = np.random.rand(256,200)
    ...: M2 = np.random.rand(256,200)
    ...: 

In [98]: out1 = np.diagonal (np.corrcoef (M1, M2), M1.shape [0])
    ...: out2 = corr2_coeff_rowwise(M1, M2)
    ...: out3 = corr2_coeff_rowwise2(M1, M2)
    ...: 

In [99]: np.allclose(out1, out2)
Out[99]: True

In [100]: np.allclose(out1, out3)
Out[100]: True

In [101]: %timeit np.diagonal (np.corrcoef (M1, M2), M1.shape [0])
     ...: %timeit corr2_coeff_rowwise(M1, M2)
     ...: %timeit corr2_coeff_rowwise2(M1, M2)
     ...: 
100 loops, best of 3: 9.5 ms per loop
1000 loops, best of 3: 554 µs per loop
1000 loops, best of 3: 430 µs per loop

20x+ speedup there with einsum over the built-in np.corrcoef!

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I've tried this solution with signal vectors of 10k samples, but unfortunately there it turns out to be considerably slower. Still thanks for this creative solution. – Jacques de Hooge Jan 18 '17 at 09:47
  • @JacquesdeHooge What are the shapes of `M1` and `M2` in that case? – Divakar Jan 18 '17 at 09:49
  • There are 256 rows (signals), and I run a sliding window of 200 samples over the 'main signal'. So actually not 10k samples, but 200 sample vectors of 256 components each. So M1 and M2 are both 256 rows x 200 columns. Sorry for the erroneous 10k samples. That's the total signal length. By using correlation with a sliding template I try to find the offsets where the template matches best. Actually I am looking for QRS complexes in a 256 channel invasive cardiogram (or rather, electrogram, as physicians call it). I'll attempt to add some of the original code context) – Jacques de Hooge Jan 18 '17 at 10:13
  • @JacquesdeHooge Check out the edited `corr2_coeff_rowwise2`? – Divakar Jan 18 '17 at 10:26
  • @JacquesdeHooge So did you check that out? – Divakar Jan 20 '17 at 19:17
0

I think it's this: (please correct if wrong!)

import numpy as np

M1 = np.array ([
    [1, 2, 3, 4],
    [2, 3, 1, 4.5]
])

M2 = np.array ([
    [10, 20, 33, 40],
    [20, 35, 15, 40]
])

v = np.diagonal (np.corrcoef (M1, M2), M1.shape [0])

print (v)

Which prints:

[ 0.99411402  0.96131896]

Since it's got only one dimension, I can think of it as a column-vector...

Jacques de Hooge
  • 6,750
  • 2
  • 28
  • 45
0

not knowing enough of numpy array magic, I'd just pick out the rows, feed each pair individually to corrcoeff

[np.corrcoef(i,j)[0][1] for i,j in zip(a,b)]

for a np.array column output

c, c.shape = np.array([np.corrcoef(i,j)[0][1] for i,j in zip(a,b)]), (a.shape[0], 1)

I'm sure there's better using numpy broadcast/indexing features

f5r5e5d
  • 3,656
  • 3
  • 14
  • 18
  • I've thought about that, and it would be acceptable, but I'd rather not loop in Python for performance reasons. I think, however, the performance drop would be minimal, since most time would be in the corcoeff itself. – Jacques de Hooge Jan 18 '17 at 09:03