How can I best optimize a matrix multiply in IDL that uses a diagonally-oriented (not diagonal) matrix

Question

I am looking for most efficient IDL code to replace the IDL matrix multiply (#) operator for a specific, diagonally-oriented (not diagonal, or diagonally-symmetric) matrix with 3 distinct values: unity on the diagonal; unity plus a delta to the right of the diagonal; unity minus the same delta to the left.

Problem domain

IDL (fixed; non-negotiable; sorry); image smear on shutter-less CCD imaging system.

Basic problem statement

Given

a 1024x1024 matrix, "EMatrix," with unity on the diagonal; (1-delta) to the left of the diagonal; (1+delta) to the right; delta = 0.044.
another 1024x1024 matrix, Image

Query

what is the fastest IDL code to calculate (Image # EMatrix)?

2014-09-16: See update below

Background

Larger problem statement (of which the matrix multiply seems to be only the slowest part, and optimizing the whole routine would not hurt):

http://pdssbn.astro.umd.edu/holdings/nh-j-lorri-3-jupiter-v1.1/document/soc_inst_icd.pdf

Section 9.3.1.2 (PDF page 47; internal page 34), and other documents in that same directory (sorry as a newbie I can only post two links)

My work so far

https://github.com/drbitboy/OptimizeDiagishIDLMatrixMultiply

That is now (2014-09-26) about an order of magnitude faster than the IDL # operator for a 1024x1024 matrix.

Details

The naive operation is O(n^3) and performs about a billion (2^30) double-precision multiplies and about the same number of additions; Wikipedia further tells me Strassen's algorithm cuts that down to O(n^2.807), or ~282M+ multiplies for n=1024.

Breaking it down for a simple 3x3 case, say image and EMatrix are

   image        EMatrix

[ 0  1  2 ]   [ 1  p  p ]
[ 3  4  5 ] # [ m  1  p ]
[ 6  7  8 ]   [ m  m  1 ]

where p represents 1+delta (1.044) and m represents 1-delta (0.956).

Because of the repetition of m's and p's, there should be a simplification available: looking at the middle column for image, the result for the three rows should be

[1,4,7] . [1,p,p] = m*(0)   + 1*1 + p*(4+7)
[1,4,7] . [m,1,p] = m*(1)   + 1*4 + p*(7)
[1,4,7] . [m,m,1] = m*(1+4) + 1*7 + p*(0)

where . represents the dot (inner?) product i.e. [1,4,7].[a,b,c] = (1a + 4b + 7c)

Based on that, here's what I did so far:

The middle term is just the column itself, and the sums multiplied by m and p look a lot like cumulative sums of contiguous sections of the column, possibly reversed (for m), shifted one and with the first element set to zero.

i.e. for m:

; shift image down one:
imgminusShift01 = shift(image,0,1)

; zero the top row:
imgminusShift01[*,0] = 0

; Make a cumulative sum:
imgminusCum = total( imageShift01, 2, /cumulative)

For p, imgplusShift01Cum follows essentially the same path but with a rotate(...,7) before and after to flip things up and down.

Once we have those three matrices (the original image, with its offspring imgminusShift01Cum and imgplusShift01Cum), the desired result is

m * imgminusShift01Cum    +   1 * image   +   p * imgplusShift01Cum

To do the shifts and rotates, I use indices, shifted and rotated themselves and stored in a common block for subsequent calls which saves another 10-20%.

Summary, so far

2014-09-16: See update below

And that gives a speedup of 5+.

I was expecting a bit more, because I think I am down to 3M multiplies and 2M additions, so maybe the memory allocation is the expensive part and I should be doing something like REPLICATE_INPLACE (my IDL is rusty - I have not done much since 6.4 and early 7.0).

Or maybe IDL's matrix multiplication is better than theory?

Alternate approaches and other thoughts:

Can we do something with the fact that the EMatrix is equal to unity plus a matrix with zeros on the diagonal and +/- delta in theupper and lower triangles?
By summing over columns I am accessing the data sequentially the "wrong" way, but would it actually save time to transpose Image first (it's only 8MB)?
Obviously choosing another language, getting a GPU array to help, or writing a DLM, would be other options, but let's keep this to IDL for now.

advTHANKSance (yes I am that old;-),

Brian Carcich 2014-07-20

Update 2014-09-16

Diego's approach brought me to a much simpler solution:

I think we got it; my first pass was too complicated, with all the rotations.

To use Diego's notation, but transposed, I am looking for

[K] = [IMG] # [E]

Since that multiplies columns of [IMG] by rows of [E], there is no interaction between columns of [IMG], so for analysis we only need to look at one column of [IMG], the dot (inner) products of which, with the rows of [E], become one column of the result [K]. Expanding that idea to one column of a 3x3 matrix with elements x, y and z:

[Kx]   [x]   [1    1+d  1+d]
[Ky] = [y] # [1-d  1    1+d]
[Kz] = [z]   [1-d  1-d  1  ]

Looking specifically at element Ky above (one element of [K], corresponding to y in [IMG], broken down to a formula using only scalars):

Ky = x * (1-d)  +  y * 1  + z * (1+d)

Generalizing for any y in a column of any length:

Ky = sumx * (1-d)  +  y * 1  + sumz * (1+d)

Where scalars sumx and sumz are sums of all values above and below, respectively, y in that column of [IMG]. N.B. sumx and sumz are values specific to element y.

Rearranging:

Ky = (sumx + y + sumz) - d * (sumx - sumz)

Ky = tot - d * (sumx - sumz)

where

tot = (sumx + y + sumz)

i.e. tot is the sum of all values in the column (e.g. in IDL: tot = total(IMG,2)).

So to this point I have basically duplicated Diego's work; the rest of this analysis converts that last equation for Ky into a form suitable for speedy evaluation in IDL.

Solving the tot equation for sumz:

sumz = tot - (y + sumx)

Substituting back into Ky:

Ky = tot - (sumx - (tot - (y + sumx)))

Ky = tot - ((2 * sumx) + y - tot)

Ky = tot + (tot - ((2 * sumx) + y)

Using sumxy to represent the sum of all values in the column from the top down to, and including, y (IDL: [SUMXY] = total([IMG],2,/CUMULATIVE))

sumxy = sumx + y

and

sumx = sumxy - y

Substituting back into Ky:

Ky = tot + (tot - ((2 * (sumxy - y)) + y)

Ky = tot + (tot + y - (2 * sumxy))

So if we can evaluate tot and sumxy for every element of [IMG], i.e. if we can evaluate the matrices [TOT] and [SUMXY], and we already have [IMG] as the matrix version of y, then it is a simple linear combination of those matrices.

In IDL, these are simply:

[SUMXY] = TOTAL([IMG],2,/CUMULATIVE)

[TOT] = [SUMXY][*,N-1] # REPLICATE(1D0,1,N)

I.e. [TOT] is the last row of [SUMXY], duplicated to form a matrix of N rows.

And the final code looks like this:

function lorri_ematrix_multiply,NxN,EMatrix

  NROWS = (size(NxN,/DIM))[1]

  SUMXY = TOTAL(NxN,2,/CUMULATIVE)

  TOT   = SUMXY[*,NROWS-1] # REPLICATE(1,NROWS,1d0)

  RETURN, TOT + ((EMatrix[1,0] - 1d0) * (TOT + NxN - (2d0 * SUMXY)))

end

which on our system is just shy of an order of magnitude faster than [IMG] # [E].

N.B. delta = (EMatrix[1,0] - 1d0)

Woo hoo!

I just ran the code on GDL and got a speedup of ~170. Maybe GDL is worse than naive? — Brian Carcich, Jul 20 '14 at 20:23

Diego Mazzaro · Accepted Answer · 2014-07-23T22:41:39.903

Step 1:

I go straight on with math notation since I think that could be more clear than explaining by words:

      [ +1 +d +d +d +d ]   [ 1 0 0 0 0 ]       [ 0 1 1 1 1 ]        [ 0 0 0 0 0 ]
      [ -d +1 +d +d +d ]   [ 0 1 0 0 0 ]       [ 0 0 1 1 1 ]        [ 1 0 0 0 0 ]
[E] = [ -d -d +1 +d +d ] = [ 0 0 1 0 0 ] + d * [ 0 0 0 1 1 ] - d *  [ 1 1 0 0 0 ]
      [ -d -d -d +1 +d ] | [ 0 0 0 1 0 ]       [ 0 0 0 0 1 ]        [ 1 1 1 0 0 ]
      [ -d -d -d -d +1 ] | [ 0 0 0 0 1 ]       [ 0 0 0 0 0 ]        [ 1 1 1 1 0 ]
                         |
                         | [ 1 0 0 0 0 ]       [ 1 1 1 1 1 ]        [ 1 0 0 0 0 ]
                         | [ 0 1 0 0 0 ]       [ 0 1 1 1 1 ]        [ 1 1 0 0 0 ]
                         = [ 0 0 1 0 0 ] + d * [ 0 0 1 1 1 ] - d *  [ 1 1 1 0 0 ]
                         | [ 0 0 0 1 0 ]       [ 0 0 0 1 1 ]        [ 1 1 1 1 0 ]
                         | [ 0 0 0 0 1 ]       [ 0 0 0 0 1 ]        [ 1 1 1 1 1 ]
                         |     [ID]                [UT]                 [LT]
                         |
                         = [ID] + d * [UT] - d * [LT]

==>

[Img] # [E] = [E]##[Img] = [Img] + d * [UT] ## [Img] - d * [LT] ## [Img]

Now let's observe what is [LT] ## [Img]:

row 1 is the same as first line of [Img]
row 2 is the (columnwise) sum of row 1 and 2 of [Img]
row i is the (columnwise) sum of all the first i rows of [Img]
row n is the (columnwise) sum of all the rows of [Img]

so an efficient way to compute it is:

TOTAL(Image, 2, /CUMULATIVE)

Analogous but a bit different is the result of [UT] ## [Img]:

row 1 is the (columnwise) sum of all the rows of [Img]
row i is the (columnwise) sum of all the last i rows of [Img]
row n-1 is the (columnwise) sum of row 1 and 2 of [Img]
row n is the same as first line of [Img]

so [UT] ## [Img] = REVERSE(TOTAL(REVERSE(Image,2), 2, /CUMULATIVE),2)

Then:

[Img] # [E] = [E]##[Img] = Image + d * (REVERSE(TOTAL(REVERSE(Image,2), 2, /CUMULATIVE),2) - TOTAL(Image, 2, /CUMULATIVE))

Note that we see that in the end the results on each column are coming only from data of that same column.

Step 2:

Now let's look to [K] = [UT] ## [Img] - [LT] ## [Img] and see how it looks like. If for each generic column we name the column elements r(1), r(2), r(3), .... ,r(i), ..., r(n) we can see that the corresponding [K] column elements R(i) looks like this:

when n is even (that is the case of 1024)

row 1     => R(1) = +r(1)                 -r(1) -r(2) .... -r(n-1) -r(n) = -SUM(j=2, n, r(n))
row 2     => R(1) = +r(1) +r(2)           -r(1) -r(2) .... -r(n-1)       = -SUM(j=3, n-1, r(n))
row 3     => R(1) = +r(1) +r(2) +r(3)     -r(1) -r(2) .... -r(n-2)       = -SUM(j=4, n-2, r(n))
    :        :       :     :     :    :    :     :     :    :               :
row i (i < n/2) 
          => R(1) = +r(1) ... +r(i)       -r(1) -r(2) .... -r(n-i+1)     = -SUM(j=i+1, n-i+1, r(n))
    :        :       :     :     :    :    :     :     :    :               :
row n/2   => R(1) = +r(1) ... +r(n/2)     -r(1) -r(2) .... -r(n/2+1)     = -r(n/2+1) 
row n/2+1 => R(1) = +r(1) ... +r(n/2+1)   -r(1) -r(2) .... -r(n/2)       = +r(n/2+1) 
    :        :       :     :     :    :    :     :     :    :               :
row i (i > n/2)
          => R(1) = +r(1) ... + r(i)      -r(1) -r(2) .... -r(n-i+1)     = +SUM(j=n-i+2, i, r(n))
                                                                         = -R(n-i+1)
    :        :       :     :     :    :    :     :     :    :               :
row n     => R(1) = +r(1) ... + r(n)      -r(1)                          = +SUM(j=2, n, r(n))
                                                                         = -R(1)

when n is odd It is similar but R((n+1)/2) will be all 0. I will not go in detail with this.

What is important is that the matrix [K] = [UT] ## [Img] - [LT] ## [Img] is anti-symmetrical with respect to its horizontal halving line.

This means that we could calculate values only in half matrix (let's say the top part) and then populate the lower part by mirroring and changing the sign. Note that efficient calculation of the top part can be done starting from R(n/2) = r(n/2+1) and going up reducing the index (R(n/2 -1), R(n/2 -2), R(n/2 -3)...) each time using R(i) = R(i+1) - r(i+1) - r(n-i+1) that can be well rewritten as R(i-1) = R(i) - r(i) - r(n-i+2).

As a matter of calculation involved this about halves the calculation done but as a matter of actual speed it needs to be tested in order to see if implementation with explicit operations are as quick as the internal implementations of built-in functions as TOTAL(/CUMULATIVE) and similar. There are good probabilities that it is quicker since we can here avoid also TRANSPOSE and/or REVERSE.

Let us know how it goes with a bit of profiling!

Diego, Thank you for that, but I am looking for the fastest code, and I don't see how your approach is faster than what I already have. Reverse(total(reverse())) is going to be pretty expensive in time. also total(,2) is about three times more more expensive in time than total(,1), so transpose(total(transpose(),1)) may be faster. Hmm, if I do the reverse with a lookup array I might save some cycles. Thank you. — Brian Carcich, Jul 23 '14 at 21:26
Yesss... the minus... .I correct it! ..also you here have one (1M) multiplication removed d * ( - ) instead than d * - d * . I totally gree to work with lookups.. ..or maybe can be better to transpose the Img and use TOTAL(<*>,1). However I'm thinking more.. ..now I try to finish to think and improve the answer! — Diego Mazzaro, Jul 23 '14 at 21:42
I am marking this as the correct answer because this contains the approach and notation that took the scales off my eyes led me to the final code, which I put in the problem statement and is also now in place in the data processing pipeline for New Horizons data. Pluto, here we come, thanks to Diego! — Brian Carcich, Sep 17 '14 at 22:56