I am looking for most efficient IDL code to replace the IDL matrix multiply (#) operator for a specific, diagonally-oriented (not diagonal, or diagonally-symmetric) matrix with 3 distinct values: unity on the diagonal; unity plus a delta to the right of the diagonal; unity minus the same delta to the left.
Problem domain
IDL (fixed; non-negotiable; sorry); image smear on shutter-less CCD imaging system.
Basic problem statement
Given
a 1024x1024 matrix, "EMatrix," with unity on the diagonal; (1-delta) to the left of the diagonal; (1+delta) to the right; delta = 0.044.
another 1024x1024 matrix, Image
Query
- what is the fastest IDL code to calculate (Image # EMatrix)?
2014-09-16: See update below
Background
Larger problem statement (of which the matrix multiply seems to be only the slowest part, and optimizing the whole routine would not hurt):
Section 9.3.1.2 (PDF page 47; internal page 34), and other documents in that same directory (sorry as a newbie I can only post two links)
My work so far
That is now (2014-09-26) about an order of magnitude faster than the IDL # operator for a 1024x1024 matrix.
Details
The naive operation is O(n^3) and performs about a billion (2^30) double-precision multiplies and about the same number of additions; Wikipedia further tells me Strassen's algorithm cuts that down to O(n^2.807), or ~282M+ multiplies for n=1024.
Breaking it down for a simple 3x3 case, say image and EMatrix are
image EMatrix
[ 0 1 2 ] [ 1 p p ]
[ 3 4 5 ] # [ m 1 p ]
[ 6 7 8 ] [ m m 1 ]
where p represents 1+delta (1.044) and m represents 1-delta (0.956).
Because of the repetition of m's and p's, there should be a simplification available: looking at the middle column for image, the result for the three rows should be
[1,4,7] . [1,p,p] = m*(0) + 1*1 + p*(4+7)
[1,4,7] . [m,1,p] = m*(1) + 1*4 + p*(7)
[1,4,7] . [m,m,1] = m*(1+4) + 1*7 + p*(0)
where . represents the dot (inner?) product i.e. [1,4,7].[a,b,c] = (1a + 4b + 7c)
Based on that, here's what I did so far:
The middle term is just the column itself, and the sums multiplied by m and p look a lot like cumulative sums of contiguous sections of the column, possibly reversed (for m), shifted one and with the first element set to zero.
i.e. for m:
; shift image down one:
imgminusShift01 = shift(image,0,1)
; zero the top row:
imgminusShift01[*,0] = 0
; Make a cumulative sum:
imgminusCum = total( imageShift01, 2, /cumulative)
For p, imgplusShift01Cum follows essentially the same path but with a rotate(...,7) before and after to flip things up and down.
Once we have those three matrices (the original image, with its offspring imgminusShift01Cum and imgplusShift01Cum), the desired result is
m * imgminusShift01Cum + 1 * image + p * imgplusShift01Cum
To do the shifts and rotates, I use indices, shifted and rotated themselves and stored in a common block for subsequent calls which saves another 10-20%.
Summary, so far
2014-09-16: See update below
And that gives a speedup of 5+.
I was expecting a bit more, because I think I am down to 3M multiplies and 2M additions, so maybe the memory allocation is the expensive part and I should be doing something like REPLICATE_INPLACE (my IDL is rusty - I have not done much since 6.4 and early 7.0).
Or maybe IDL's matrix multiplication is better than theory?
Alternate approaches and other thoughts:
Can we do something with the fact that the EMatrix is equal to unity plus a matrix with zeros on the diagonal and +/- delta in theupper and lower triangles?
By summing over columns I am accessing the data sequentially the "wrong" way, but would it actually save time to transpose Image first (it's only 8MB)?
Obviously choosing another language, getting a GPU array to help, or writing a DLM, would be other options, but let's keep this to IDL for now.
advTHANKSance (yes I am that old;-),
Brian Carcich 2014-07-20
Update 2014-09-16
Diego's approach brought me to a much simpler solution:
I think we got it; my first pass was too complicated, with all the rotations.
To use Diego's notation, but transposed, I am looking for
[K] = [IMG] # [E]
Since that multiplies columns of [IMG] by rows of [E], there is no interaction between columns of [IMG], so for analysis we only need to look at one column of [IMG], the dot (inner) products of which, with the rows of [E], become one column of the result [K]. Expanding that idea to one column of a 3x3 matrix with elements x, y and z:
[Kx] [x] [1 1+d 1+d]
[Ky] = [y] # [1-d 1 1+d]
[Kz] = [z] [1-d 1-d 1 ]
Looking specifically at element Ky above (one element of [K], corresponding to y in [IMG], broken down to a formula using only scalars):
Ky = x * (1-d) + y * 1 + z * (1+d)
Generalizing for any y in a column of any length:
Ky = sumx * (1-d) + y * 1 + sumz * (1+d)
Where scalars sumx and sumz are sums of all values above and below, respectively, y in that column of [IMG]. N.B. sumx and sumz are values specific to element y.
Rearranging:
Ky = (sumx + y + sumz) - d * (sumx - sumz)
Ky = tot - d * (sumx - sumz)
where
tot = (sumx + y + sumz)
i.e. tot is the sum of all values in the column (e.g. in IDL: tot = total(IMG,2)).
So to this point I have basically duplicated Diego's work; the rest of this analysis converts that last equation for Ky into a form suitable for speedy evaluation in IDL.
Solving the tot equation for sumz:
sumz = tot - (y + sumx)
Substituting back into Ky:
Ky = tot - (sumx - (tot - (y + sumx)))
Ky = tot - ((2 * sumx) + y - tot)
Ky = tot + (tot - ((2 * sumx) + y)
Using sumxy to represent the sum of all values in the column from the top down to, and including, y (IDL: [SUMXY] = total([IMG],2,/CUMULATIVE))
sumxy = sumx + y
and
sumx = sumxy - y
Substituting back into Ky:
Ky = tot + (tot - ((2 * (sumxy - y)) + y)
Ky = tot + (tot + y - (2 * sumxy))
So if we can evaluate tot and sumxy for every element of [IMG], i.e. if we can evaluate the matrices [TOT] and [SUMXY], and we already have [IMG] as the matrix version of y, then it is a simple linear combination of those matrices.
In IDL, these are simply:
[SUMXY] = TOTAL([IMG],2,/CUMULATIVE)
[TOT] = [SUMXY][*,N-1] # REPLICATE(1D0,1,N)
I.e. [TOT] is the last row of [SUMXY], duplicated to form a matrix of N rows.
And the final code looks like this:
function lorri_ematrix_multiply,NxN,EMatrix
NROWS = (size(NxN,/DIM))[1]
SUMXY = TOTAL(NxN,2,/CUMULATIVE)
TOT = SUMXY[*,NROWS-1] # REPLICATE(1,NROWS,1d0)
RETURN, TOT + ((EMatrix[1,0] - 1d0) * (TOT + NxN - (2d0 * SUMXY)))
end
which on our system is just shy of an order of magnitude faster than [IMG] # [E].
N.B. delta = (EMatrix[1,0] - 1d0)
Woo hoo!