19

I have a function in MATLAB which performs the Gram-Schmidt Orthogonalisation with a very important weighting applied to the inner-products (I don't think MATLAB's built in function supports this). This function works well as far as I can tell, however, it is too slow on large matrices. What would be the best way to improve this?

I have tried converting to a MEX file but I lose parallelisation with the compiler I'm using and so it is then slower.

I was thinking of running it on a GPU as the element-wise multiplications are highly parallelised. (But I'd prefer the implementation to be easily portable)

Can anyone vectorise this code or make it faster? I am not sure how to do it elegantly ...

I know the stackoverflow minds here are amazing, consider this a challenge :)

Function

function [Q, R] = Gram_Schmidt(A, w)
    [m, n] = size(A);
    Q = complex(zeros(m, n));
    R = complex(zeros(n, n));
    v = zeros(n, 1);

    for j = 1:n
        v = A(:,j);
        for i = 1:j-1
            R(i,j) = sum(   v    .* conj( Q(:,i) ) .* w ) / ...
                     sum( Q(:,i) .* conj( Q(:,i) ) .* w );
            v = v - R(i,j) * Q(:,i);
        end
        R(j,j) = norm(v);
        Q(:,j) = v / R(j,j);
    end
end

where A is an m x n matrix of complex numbers and w is an m x 1 vector of real numbers.

Bottle-neck

This is the expression for R(i,j) which is the slowest part of the function (not 100% sure if the notation is correct): Weighted Inner-Product

where w is a non-negative weight function. The weighted inner-product is mentioned on several Wikipedia pages, this is one on the weight function and this is one on orthogonal functions.

Reproduction

You can produce results using the following script:

A = complex( rand(360000,100), rand(360000,100));
w = rand(360000, 1);
[Q, R] = Gram_Schmidt(A, w);

where A and w are the inputs.

Speed and Computation

If you use the above script you will get profiler results synonymous to the following: Profiler Times Profiler Code Times

Testing Result

You can test the results by comparing a function with the one above using the following script:

A = complex( rand( 100, 10), rand( 100, 10));
w = rand( 100, 1);
[Q , R ] = Gram_Schmidt( A, w);
[Q2, R2] = Gram_Schmidt2( A, w);
zeros1 = norm( Q - Q2 );
zeros2 = norm( R - R2 );

where Gram_Schmidt is the function described earlier and Gram_Schmidt2 is an alternative function. The results zeros1 and zeros2 should then be very close to zero.

Note:

I tried speeding up the calculation of R(i,j) with the following but to no avail ...

R(i,j) = ( w' * (   v    .* conj( Q(:,i) ) ) ) / ...
         ( w' * ( Q(:,i) .* conj( Q(:,i) ) ) );
JacobD
  • 627
  • 6
  • 15
  • 2
    I bet I'm spoiling all the fun, but [MIT has a nice PDF about it](http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/related-resources/MIT18_06S10_gramschmidtmat.pdf). – rubenvb Nov 12 '14 at 13:18
  • @rubenvb I have seen this PDF from MIT before. It is very good except that it doesn't account for any weighting of the inner-product in the Gram-Schmidt. I have not been able to find anyone who applies a weighting to the Gram-Schmidt method so far. – JacobD Nov 12 '14 at 13:21
  • @rubenvb Actually, I might add, the orthogonalisation is much faster without the extra weighting matrix. However, when the extra matrix is introduced (as far as I can see) it means I need to do element-wise multiplications and a summation (where a simple multiplication and transpose worked before). – JacobD Nov 12 '14 at 13:27
  • 1
    Here's [another link](http://www8.cs.umu.se/~marten/forskning/wgs_report.ps) in which it is claimed that both the weighted and original algorithm have the same cost. This is of course no guarantee that the code will run equally fast, as there are obviously more multiplications involved. – rubenvb Nov 12 '14 at 13:31
  • 1
    @JacobD If `R(i,j)` is the slow part, you can precompute `T = sum(bsxfun(@times, abs(Q).^2, w),1);` or `T = abs(Q).'.^2*w;` and then use `T(i)` instead of the denominator (`w' * ( Q(:,i) .* conj( Q(:,i) ) )`). That way you avoid computing the same denominator `n` times (one for each `j`) – Luis Mendo Nov 12 '14 at 13:50
  • @rubenvb I have seen this before but for some reason had not read the whole paper. I may see if an implementation from that paper will help over the next few days. – JacobD Nov 12 '14 at 13:51
  • @LuisMendo I have tried pre-computing the denominator outside of the `for i = 1:j-1` loop before but it makes the calculation slower. Most of the time is then spent on the pre-computation part. – JacobD Nov 12 '14 at 14:06
  • @JacobD But it should be done just once, outside _both_ loops! – Luis Mendo Nov 12 '14 at 16:08
  • @LuisMendo Q is initially zeros, though? Q is also the orthogonal set, can that be simply pre-computed? – JacobD Nov 13 '14 at 00:03
  • @JacobD I don't think `Q` can be precomputed – Luis Mendo Nov 13 '14 at 07:24
  • @LuisMendo How can I precompute `T` if `Q` is zeros to begin with? If I do what you say T will just be zeros? i.e. `T = abs(zeros(m, n)).'.^2*w = zeros(m, n)` – JacobD Nov 13 '14 at 11:42
  • @JacobD You're right, it can't be precomputed. Sorry for the confusion – Luis Mendo Nov 13 '14 at 13:00
  • @JacobD is `w` a scalar? can you write a clear expression for your weighted inner-product? – Shai Nov 24 '14 at 13:11
  • 3
    Please provide sample inputs and profiler results to make for a properly reproducible example. – Dennis Jaheruddin Nov 24 '14 at 15:21
  • @Shai `w` is an `m x 1` vector. I will edit with a written expression today. @DennisJaheruddin I will edit to clarify when I get the chance today. – JacobD Nov 24 '14 at 20:55
  • You could write your code in R, and use vectorization, which should speed up the code significantly. –  Nov 25 '14 at 00:56
  • @John I ask if anyone can vectorise my code in the question because it does not appear to be easy (or possible?). I can't see that vectorising this code in R would be substantially different to vectorising it in MATLAB...? – JacobD Nov 25 '14 at 01:02
  • R allows tapply, which is ragged vectorization, for your inner loop. I don't think matlab has an equivalent. –  Nov 25 '14 at 01:09
  • Some minor comments, in the inner loop, operations that are repeated should be avoid. Assign qi=Q(:,i), cqw = conj( Q(:,i) ) .* w –  Nov 25 '14 at 01:14
  • In your sample inputs, can you remind me/us how to check if the output is correct? Add an expression to check the values of Q/R. –  Nov 25 '14 at 01:16
  • @John I will read up on tapply. I have tried preventing repeated operations, however, if I do what you say the function becomes slower. I think MATLAB knows to optimise this somehow? – JacobD Nov 25 '14 at 01:17
  • I did my MS on this problem, about 30 yrs ago. If you need speed for large problems, I would suggest writing in a fast base language such as C or C++. Matlab is in Java, R is in C (generally). IMO, Java is not designed for speed. –  Nov 25 '14 at 01:22
  • The code as you have it written would run very slow in R because R vectorizes everything, so loops are to be avoided. Instead of loops, use a variation of the apply function (lapply, tapply) in this case. –  Nov 25 '14 at 01:24
  • This code vectorizes on loop `j`. However, performance worsens (40 vs 110 seconds). Replace the loop `j` with: `for j = 1:n j v = A(:,j); QPart=conj(Q(:,1:j-1)); QPartConj=Q(:,1:j-1); if ~isempty(QPart) numR=sum(bsxfun(@times,bsxfun(@times,QPart,v),w),1); denoR=sum(bsxfun(@times,bsxfun(@times,QPart,QPartConj),w),1); R(1:j-1,j)=(numR./denoR).'; v=v-(sum(bsxfun(@times,(R(1:j-1,j)).',QPartConj),2)); else R(:,j)=0; end R(j,j) = norm(v); Q(:,j) = v / R(j,j); end` – Autonomous Nov 25 '14 at 01:27
  • I should also add that this runs faster than your version till j=20 (or maybe just more), then it starts becomeing slower. So if you can apply the vectorized code on chunks, you may observe some improvement. Also, have you tried [`MKL`](http://en.wikipedia.org/wiki/Math_Kernel_Library)? C combined with MKL may boost your speed by significant margin. – Autonomous Nov 25 '14 at 01:30
  • @ParagS.Chandakkar I haven't tried MKL before, I may look into it. How much faster does the vectorised code run before it becomes slower and can you formulate the point at which becomes slower? – JacobD Nov 25 '14 at 05:45
  • Its best for you to try. Just replace your `j` loop with the one given in the comment. I think the code becomes slower at `j=20` – Autonomous Nov 25 '14 at 08:40
  • Since you are willing to try out GPUs, have you thought of implementing it with CUDA? – Divakar Nov 25 '14 at 08:51
  • @Divakar I would preferably like to see how much better I can get without the need for GPU processing, however, I understand it may be one of the best options. I was about to implement it and then thought I'd ask this question on stackoverflow instead first. – JacobD Nov 25 '14 at 10:01
  • @ParagS.Chandakkar I can get your vectorisation to work but it never runs faster than mine? Not at any iteration. – JacobD Nov 25 '14 at 10:24
  • 2
    @John Matlab's IDE is in Java. The engine is in C and oiriginally was in Fortran. – Oleg Nov 26 '14 at 00:02
  • @OlegKomarov Thanks, I was aware of the fortan origin, but thought the engine was in java. –  Nov 26 '14 at 00:06

3 Answers3

9

1)

My first attempt at vectorization:

function [Q, R] = Gram_Schmidt1(A, w)
    [m, n] = size(A);
    Q = complex(zeros(m, n));
    R = complex(zeros(n, n));

    for j = 1:n
        v = A(:,j);
        QQ = Q(:,1:j-1);
        QQ = bsxfun(@rdivide, bsxfun(@times, w, conj(QQ)), w.' * abs(QQ).^2);
        for i = 1:j-1
            R(i,j) = (v.' * QQ(:,i));
            v = v - R(i,j) * Q(:,i);
        end
        R(j,j) = norm(v);
        Q(:,j) = v / R(j,j);
    end
end

Unfortunately, it turned out to be slower than the original function.


2)

Then I realized that the columns of this intermediate matrix QQ are built incrementally, and that previous ones are not modified. So here is my second attempt:

function [Q, R] = Gram_Schmidt2(A, w)
    [m, n] = size(A);
    Q = complex(zeros(m, n));
    R = complex(zeros(n, n));
    QQ = complex(zeros(m, n-1));

    for j = 1:n
        if j>1
            qj = Q(:,j-1);
            QQ(:,j-1) = (conj(qj) .* w) ./ (w.' * (qj.*conj(qj)));
        end
        v = A(:,j);
        for i = 1:j-1
            R(i,j) = (v.' * QQ(:,i));
            v = v - R(i,j) * Q(:,i);
        end
        R(j,j) = norm(v);
        Q(:,j) = v / R(j,j);
    end
end

Technically no major vectorization was done; I've only precomputed intermediate results, and moved the computation outside the inner loop.

Based on a quick benchmark, this new version is definitely faster:

% some random data
>> M = 10000; N = 100;
>> A = complex(rand(M,N), rand(M,N));
>> w = rand(M,1);

% time
>> timeit(@() Gram_Schmidt(A,w), 2)     % original version
ans =
    1.2444
>> timeit(@() Gram_Schmidt1(A,w), 2)    % first attempt (vectorized)
ans =
    2.0990
>> timeit(@() Gram_Schmidt2(A,w), 2)    % final version
ans =
    0.4698

% check results
>> [Q,R] = Gram_Schmidt(A,w);
>> [Q2,R2] = Gram_Schmidt2(A,w);
>> norm(Q-Q2)
ans =
   4.2796e-14
>> norm(R-R2)
ans =
   1.7782e-12

EDIT:

Following the comments, we can rewrite the second solution to get rid of the if-statmenet, by moving that part to the end of the outer loop (i.e immediately after computing the new column Q(:,j), we compute and store the corresponding QQ(:,j)).

The function is identical in output, and timing is not that different either; the code is just a bit shorter!

function [Q, R] = Gram_Schmidt3(A, w)
    [m, n] = size(A);
    Q = zeros(m, n, 'like',A);
    R = zeros(n, n, 'like',A);
    QQ = zeros(m, n, 'like',A);

    for j = 1:n
        v = A(:,j);
        for i = 1:j-1
            R(i,j) = (v.' * QQ(:,i));
            v = v - R(i,j) * Q(:,i);
        end
        R(j,j) = norm(v);
        Q(:,j) = v / R(j,j);
        QQ(:,j) = (conj(Q(:,j)) .* w) ./ (w.' * (Q(:,j).*conj(Q(:,j))));
    end
end

Note that I used the zeros(..., 'like',A) syntax (new in recent MATLAB versions). This allows us to run the function unmodified on the GPU (assuming you have the Parallel Computing Toolbox):

% CPU
[Q3,R3] = Gram_Schmidt3(A, w);

vs.

% GPU
AA = gpuArray(A);
[Q3,R3] = Gram_Schmidt3(AA, w);

Unfortunately in my case, it wasn't any faster. In fact it was many times slower to run on the GPU than on the CPU, but it was worth a shot :)

Amro
  • 123,847
  • 25
  • 243
  • 454
  • Nice! It is indeed faster. Even more so when `n` is larger. In my unfortunate situation `m` is considerably larger than `n`, nevertheless this is a decent improvement. – JacobD Nov 25 '14 at 12:27
  • 1
    It just sped up one of my ~35min computations to ~22mins. – JacobD Nov 25 '14 at 12:37
  • You could move the j=1 iteration out of the loop, moved before the loop, then change the loop from j=2 to n and remove the if statement. –  Nov 26 '14 at 00:20
  • 1
    @John: Actually I already tried that, the performance was pretty much identical (probably thanks to [branch prediction](https://en.wikipedia.org/wiki/Branch_predictor) at the CPU level, seeing that the if-statement is almost always true) – Amro Nov 26 '14 at 00:28
  • @Amro Assuming I remember my complex multiplication, and matlab... You could try the following 1) calculate abs(qj).^2; 3) change qj.*conj(qj) to abs(qj).^2; 3) change conj(qj) .* w to abs(qj).^2 .* w ./ qj. –  Nov 26 '14 at 00:50
  • @Amro Good ol' branch prediction. – JacobD Nov 26 '14 at 01:38
  • @Amro I actually find it to be ever so slightly faster if `qj` isn't used and `Q(:,j-1)` is put straight into the line for `QQ`. @John This means that when I do what you say there is no performance improvement. – JacobD Nov 26 '14 at 06:11
  • you guys are probably right; there are still a couple of minor tweaks we can do here, but I think the performance gains are not as important, and IMO it comes at the expense of code readability... – Amro Nov 26 '14 at 12:52
  • In the original question, the numerator and denominator of R depend on the i loop In your fastest solution, the denominator is independent of i. Does that produce the same result? It seems the increase in speed may come from making the denominator a constant, which doesn't seem right. –  Nov 30 '14 at 09:18
  • 2
    @John: that's precisely the "trick" in my answer; note that the inner loop goes from `i=1:j-1` and uses `Q(:,i)`, then the outer loop modifies `Q(:,j)` for the current `j`. This means that inside the inner loop, `Q(:,1:j-1)` is already computed from previous iterations of the outer loop, so we can take take that expression outside of the inner loop. We still have the vector `v` involved which changes in the inner loop, so I just took the right-hand side of the nominator and pre-multiplied it by the denominator into `QQ`, and instead of `sum(x.*y)` I used inner product `x'*y`. Makes sense? – Amro Nov 30 '14 at 12:59
  • @John: in fact, we can rewrite the second solution to get rid of the if-statement, by computing and adding the next `QQ(:,j)` at the end of the outer loop: http://pastebin.com/KwVjPgLD . The result is identical to before. – Amro Nov 30 '14 at 13:24
  • @Amro Could you please elaborate on how you tested your solution on the GPU. What size matrices did you use? What are the specifications of the GPU you used? What times did you get? How did you time it? (i.e. did you time the whole process including the time to copy memory to the GPU or just the processing time on the GPU?) – JacobD Dec 09 '14 at 05:59
  • @JacobD: sorry for the late reply. I had done a quick test with a medium-sized matrix. I compared only the computation time between the CPU version and the GPU version, without including data transfer times between host and device, I used proper `timeit`/`gputimeit` functions. Testing was done on my 3rd gen quad-core i7 laptop with a low-end Nvidia GT630M card running Windows 8.1, nothing fancy :) – Amro Dec 12 '14 at 02:52
  • @Amro How much slower was it? (you only stater 'many times'. Perhaps it would give me an idea as to whether or not to pursue using a GPU with more cores, a faster clock speed and/or better memory bandwidth. (I know I state in my question that I'd prefer portability but I can always allow the switch to GPU computation) – JacobD Dec 12 '14 at 02:59
  • @JacobD: I think it was more than x10 times slower, which is why I gave up on it quickly, but maybe it will perform better on a more powerful machine (it's not hard to try!)... Now you mentioned in your question that you tried writing a MEX-file, how did that go? If you're still looking to improve performance, I think you should reconsider the C/C++ route; especially by using the optimized BLAS library that [ships](http://www.mathworks.com/help/matlab/matlab_external/calling-lapack-and-blas-functions-from-mex-files.html) with MATLAB (Intel MKL) to perform the vectorized multiplications. – Amro Dec 12 '14 at 03:44
  • ... Check out these posts for some examples I posted in the past: http://stackoverflow.com/a/6440839/97160, http://stackoverflow.com/a/26467886/97160 – Amro Dec 12 '14 at 03:44
7

There is a long discussion here, but, to jump to the answer. You have weighted the numerator and denominator of the R calculation by a vector w. The weighting occurs on the inner loop, and consist of a triple dot product, A dot Q dot w in the numerator, and Q dot Q dot w in the denominator. If you make one change, I think the code will run significantly faster. Write num = (A dot sqrt(w)) dot (Q dot sqrt(w)) and write den = (Q dot sqrt(w)) dot (Q dot sqrt(w)). That moves the (A dot sqrt(w)) and (Q dot sqrt(w)) product calculations out of the inner loop.

I would like to give an description of the formulation to Gram Schmidt Orthogonalization, that, hopefully, in addition to giving an alternate computational solution, gives further insight into the advantage of GSO.

The "goals" of GSO are two fold. First, to enable the solution of an equation like Ax=y, where A has far more rows than columns. This situation occurs frequently when measuring data, in that it is easy to measure more data than the number of states. The approach to the first goal is to rewrite A as QR such that the columns of Q are orthogonal and normalized, and R is a triangular matrix. The algorithm you provided, I believe, achieves the first goal. The Q represents the basis space of the A matrix, and R represents the amplitude of each basis space required to generate each column of A.

The second goal of GSO is to rank the basis vectors in order of significance. This the step that you have not done. And, while including this step, may increase the solution time, the results will identify which elements of x are important, according the data contained in the measurements represented by A.

But, I think, with this implementation, the solution is faster than the approach you presented.

Aij = Qij Rij where Qj are orthonormal and Rij is upper triangular, Ri,j>i=0. Qj are the orthogonal basis vectors for A, and Rij is the participation of each Qj to create a column in A. So,

A_j1 = Q_j1 * R_1,1
A_j2 = Q_j1 * R_1,2 + Q_j2 * R_2,2
A_j3 = Q_j1 * R_1,3 + Q_j2 * R_2,3 + Q_j3 * R_3,3

By inspection, you can write

A_j1 = ( A_j1 / | A_j1 | )  * | A_j1 | = Q_j1 * R_1,1

Then you project Q_j1 onto from every other column A to get the R_1,j elements

R_1,2 = Q_j1 dot Aj2
R_1,3 = Q_j1 dot Aj3
...
R_1,j(j>1) = A_j dot Q_j1

Then you subtract the elements of project of Q_j1 from the columns of A (this would set the first column to zero, so you can ignore the first column

for j = 2,n
  A_j = A_j - R_1,j * Q_j1
end

Now one column from A has been removed, the first orthonormal basis vector, Q,j1, was determined, and the contribution of the first basis vector to each column, R_1,j has been determined, and the contribution of the first basis vector has been subtracted from each column. Repeat this process for the remaining columns of A to obtain the remaining columns of Q and rows of R.

for i = 1,n
  R_ii = |A_i|                    A_i is the ith column of A, |A_i| is magnitude of A_i
  Q_i = A_i / R_ii                Q_i is the ith column of Q
  for j = i, n
    R_ij = | A_j dot Q_i | 
    A_j = A_j - R_ij * Q_i
  end
end

You are trying to weight the rows of A, with w. Here is one approach. I would normalize w, and incorporate the effect into R. You "removed" the effects of w by multiply and dividing by w. An alternative to "removing" the effect is to normalize the amplitude of w to one.

w = w / | w |
for i = 1,n
  R_ii = |A_i inner product w|        # A_i inner product w = A_i .* w                  
  Q_i = A_i / R_ii              
  for j = i, n
    R_ij = | (A_i inner product w) dot Q_i |      # A dot B = A' * B 
    A_j = A_j - R_ij * Q_i
  end
end

Another approach to implementing w is to normalize w and then premultiply every column of A by w. That cleanly weights the rows of A, and reduces the number of multiplications.

Using the following may help in speeding up your code

A inner product B = A .* B
A dot w = A' w
(A B)' = B'A'
A' conj(A) = |A|^2

The above can be vectorized easily in matlab, pretty much as written.

But, you are missing the second portion of ranking of A, which tells you which states (elements of x in A x = y) are significantly represented in the data

The ranking procedure is easy to describe, but I'll let you work out the programming details. The above procedure essentially assumes the columns of A are in order of significance, and the first column is subtracted off all the remaining columns, then the 2nd column is subtracted off the remaining columns, etc. The first row of R represents the contribution of the first column of Q to each column of A. If you sum the absolute value of the first row of R contributions, you get a measurement of the contribution of the first column of Q to the matrix A. So, you just evaluate each column of A as the first (or next) column of Q, and determine the ranking score of the contribution of that Q column to the remaining columns of A. Then select the A column that has the highest rank as the next Q column. Coding this essentially comes down to pre estimating the next row of R, for every remaining column in A, in order to determine which ranked R magnitude has the largest amplitude. Having a index vector that represents the original column order of A will be beneficial. By ranking the basis vectors, you end up with the "principal" basis vectors that represent A, which is typically much smaller in number than the number of columns in A.

Also, if you rank the columns, it is not necessary to calculate every column of R. When you know which columns of A don't contain any useful information, there's no real benefit to keeping those columns.

In structural dynamics, one approach to reducing the number of degrees of freedom is to calculate the eigenvalues, assuming you have representative values for the mass and stiffness matrix. If you think about it, the above approach can be used to "calculate" the M and K (and C) matrices from measured response, and also identify the "measurement response shapes" that are significantly represented in the data. These are diffenrent, and potentially more important, than the mode shapes. So, you can solve very difficult problems, i.e., estimation of state matrices and number of degrees of freedom represented, from measured response, by the above approach. If you read up on N4SID, he did something similar, except he used SVD instead of GSO. I don't like the technical description for N4SID, too much focus on vector projection notation, which is simply a dot product.

There may be one or two errors in the above information, I'm writing this off the top of my head, before rushing off to work. So, check the algorithm / equations as you implement... Good Luck

Coming back to your question, of how to optimize the algorithm when you weight with w. Here is a basic GSO algorithm, without the sorting, written compatible with your function.

Note, the code below is in octave, not matlab. There are some minor differences.

function [Q, R] = Gram_Schmidt_2(A, w)

    [m, n] = size(A);
    Q = complex(zeros(m, n));
    R = complex(zeros(n, n));

    # Outer loop identifies the basis vectors    
    for j = 1:n
      aCol = A(:,j);
      # Subtract off the basis vector
      for i = 1:(j-1)
        R(i,j) = ctranspose(Q(:,j)) * aCol;
        aCol = aCol - R(i,j) * Q(:,j);
      end
      amp_A_col = norm(aCol);
      R(j,j) = amp_A_col;
      Q(:,j) = aCol / amp_A_col;
    end
end

To get your algorithm, only change one line. But, you lose a lot of speed because "ctranspose(Q(:,j)) * aCol" is a vector operation but "sum( aCol .* conj( Q(:,i) ) .* w )" is a row operation.

function [Q, R] = Gram_Schmidt_2(A, w)

    [m, n] = size(A);
    Q = complex(zeros(m, n));
    R = complex(zeros(n, n));

    # Outer loop identifies the basis vectors    
    for j = 1:n
      aCol = A(:,j);
      # Subtract off the basis vector
      for i = 1:(j-1)
        # R(i,j) = ctranspose(Q(:,j)) * aCol;
        R(i,j) = sum(   aCol    .* conj( Q(:,i) ) .* w ) / ...
                 sum( Q(:,i) .* conj( Q(:,i) ) .* w );
        aCol = aCol - R(i,j) * Q(:,j);
      end
      amp_A_col = norm(aCol);
      R(j,j) = amp_A_col;
      Q(:,j) = aCol / amp_A_col;
    end
end

You can change it back to a vector operation by weighting aCol and Q by the sqrt of w.

function [Q, R] = Gram_Schmidt_3(A, w)

    [m, n] = size(A);
    Q = complex(zeros(m, n));
    R = complex(zeros(n, n));
    Q_sw = complex(zeros(m, n));
    sw = w .^ 0.5;
    for j = 1:n
      aCol = A(:,j);
      aCol_sw = aCol .* sw;
      # Subtract off the basis vector
      for i = 1:(j-1)
        # R(i,j) = ctranspose(Q(:,i)) * aCol;
        numTerm = ctranspose( Q_sw(:,i) ) * aCol_sw;
        denTerm = ctranspose( Q_sw(:,i) ) * Q_sw(:,i);
        R(i,j) = numTerm / denTerm;
        aCol_sw = aCol_sw - R(i,j) * Q_sw(:,i);
      end
      aCol = aCol_sw ./ sw;
      amp_A_col = norm(aCol);
      R(j,j) = amp_A_col;
      Q(:,j) = aCol / amp_A_col;
      Q_sw(:,j) = Q(:,j) .* sw;
    end
end

As pointed out by JacobD, the above function does not run faster. Possibly it takes time to create the additional arrays. Another grouping strategy for the triple product is to group w with conj(Q). Hope this is faster...

function [Q, R] = Gram_Schmidt_4(A, w)

    [m, n] = size(A);
    Q = complex(zeros(m, n));
    R = complex(zeros(n, n));

    for j = 1:n
      aCol = A(:,j);
      for i = 1:(j-1)
        cqw = conj(Q(:,i)) .* w;
        R(i,j) = ( transpose(  aCol )  * cqw) ...
               / (transpose( Q(:,i) ) * cqw);
        aCol = aCol - R(i,j) * Q(:,i);
      end
      amp_A_col = norm(aCol);
      R(j,j) = amp_A_col;
      Q(:,j) = aCol / amp_A_col;
    end
end

Here's a driver function to time different versions.

function Gram_Schmidt_tester_2
      nSamples = 360000;
      nMeas = 100;
      nMeas = 15;

      A = complex( rand(nSamples,nMeas), rand(nSamples,nMeas));
      w = rand(nSamples, 1);

      profile on;
      [Q1, R1] = Gram_Schmidt_basic(A);
      profile off;
      data1 = profile ("info");
      tData1=data1.FunctionTable(1).TotalTime;
      approx_zero1 = A - Q1 * R1;
      max_value1 = max(max(abs(approx_zero1)));

      profile on;
      [Q2, R2] = Gram_Schmidt_w_Orig(A, w);
      profile off;
      data2 = profile ("info");
      tData2=data2.FunctionTable(1).TotalTime;
      approx_zero2 = A - Q2 * R2;
      max_value2 = max(max(abs(approx_zero2)));

      sw=w.^0.5;
      profile on;
      [Q3, R3] = Gram_Schmidt_sqrt_w(A, w);
      profile off;
      data3 = profile ("info");
      tData3=data3.FunctionTable(1).TotalTime;
      approx_zero3 = A - Q3 * R3;
      max_value3 = max(max(abs(approx_zero3)));

      profile on;
      [Q4, R4] = Gram_Schmidt_4(A, w);
      profile off;
      data4 = profile ("info");
      tData4=data4.FunctionTable(1).TotalTime;
      approx_zero4 = A - Q4 * R4;
      max_value4 = max(max(abs(approx_zero4)));

      profile on;
      [Q5, R5] = Gram_Schmidt_5(A, w);
      profile off;
      data5 = profile ("info");
      tData5=data5.FunctionTable(1).TotalTime;
      approx_zero5 = A - Q5 * R5;
      max_value5 = max(max(abs(approx_zero5)));


      profile on;
      [Q2a, R2a] = Gram_Schmidt2a(A, w);
      profile off;
      data2a = profile ("info");
      tData2a=data2a.FunctionTable(1).TotalTime;
      approx_zero2a = A - Q2a * R2a;
      max_value2a = max(max(abs(approx_zero2a)));


      profshow (data1, 6);
      profshow (data2, 6);
      profshow (data3, 6);
      profshow (data4, 6);
      profshow (data5, 6);
      profshow (data2a, 6);

      sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
         data1.FunctionTable(1).FunctionName,
         data1.FunctionTable(1).TotalTime,     
         nSamples, nMeas, max_value1)
      sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
         data2.FunctionTable(1).FunctionName,
         data2.FunctionTable(1).TotalTime,     
         nSamples, nMeas, max_value2)
      sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
         data3.FunctionTable(1).FunctionName,
         data3.FunctionTable(1).TotalTime,     
         nSamples, nMeas, max_value3)
      sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
         data4.FunctionTable(1).FunctionName,
         data4.FunctionTable(1).TotalTime,     
         nSamples, nMeas, max_value4)
      sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
         data5.FunctionTable(1).FunctionName,
         data5.FunctionTable(1).TotalTime,     
         nSamples, nMeas, max_value5)
      sprintf('Time for %s is %5.3f sec with %d samples and %d meas, max value is %g',
         data2a.FunctionTable(1).FunctionName,
         data2a.FunctionTable(1).TotalTime,     
         nSamples, nMeas, max_value2a)

end

On my old home laptop, in Octave, the results are

ans = Time for Gram_Schmidt_basic is 0.889 sec with 360000 samples and 15 meas, max value is 1.57009e-16
ans = Time for Gram_Schmidt_w_Orig is 0.952 sec with 360000 samples and 15 meas, max value is 6.36717e-16
ans = Time for Gram_Schmidt_sqrt_w is 0.390 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt_4 is 0.452 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt_5 is 2.636 sec with 360000 samples and 15 meas, max value is 6.47366e-16
ans = Time for Gram_Schmidt2a is 0.905 sec with 360000 samples and 15 meas, max value is 6.68443e-16

These results indicate the fastest algorithm is the sqrt_w algorithm above at 0.39 sec, followed by the grouping of conj(Q) with w (above) at 0.452 sec, then version 2 of Amro solution at 0.905 sec, then the original algorithm in the question at 0.952, then a version 5 which interchanges rows / columns to see if row storage presented (code not included) at 2.636 sec. These results indicate the sqrt(w) split between A and Q is the fastest solution. But these results are not consistent with JacobD's comment about sqrt(w) not being faster.

  • In your `Gram_Schmidt_3` function I think the `j` index should be `i` for `Q_sw` in the inner-loop. – JacobD Nov 28 '14 at 00:34
  • I can't get it to run any faster, though. – JacobD Nov 28 '14 at 00:40
  • Thanks for the correct on i/j, and I should hve checked the speed. Try Gram_Schmidt_4. –  Nov 28 '14 at 11:19
  • @John Would you be able to post times for `nMeas=100` or even larger, say `nMeas=300`? I think Amro's version 2 solution is quicker when "nMeas" is larger – JacobD Nov 30 '14 at 06:24
  • @John Actually when `nSamples` or `nMeas` is larger. Although, when I use the same values as you I still get Amro's version 2 with faster speeds but when I make `nSamples` and `nMeas` smaller yours becomes faster. Hmmm – JacobD Nov 30 '14 at 06:36
  • @JacobD, the run time increases significantly for increasing nMeas, so I can't increase nMeas beyound 20. I wonder, do these different approaches produce the same values for Q and R? –  Nov 30 '14 at 09:13
  • 1
    @John That's a shame, I need to compute for large values of `nMeas (i.e. up to about 300 for my particular application). So far, to the eye, the values from all these methods appear to be identical, there may be minor differences but nothing major. – JacobD Nov 30 '14 at 09:56
  • @JacobD There is a lot of information in these posts about how to optimize the calculation of R for speed. You should be able to write the function in C, and use a mex function in Matlab. That should give you the fastest implementation of these algorithms. Also, if you have more than 30 columns (nMeas), you should develop an algorithm that ranks the columns before picking the next Q basis vector, to rank the Q basis vectors, and also to minimize the number of Q basis vectors used. Essentially calculate the reduction in the magnitude of each A column, and pick the Q that gives the –  Dec 01 '14 at 01:16
  • @JacobD (continued) gives the largest reduction of the remaining A columns. Also, stop once the A columns are below 5% of their original size. –  Dec 01 '14 at 01:17
  • @John I have coded a MEX file for this algorithm previously but couldn't get it to run any faster. Could you provide an example of the column ranking idea comparing its speed to the original? – JacobD Dec 01 '14 at 01:50
0

It is possible to vectorize this so only one loop is necessary. The important fundamental change from the original algorithm is that if you swap the inner and outer loops you can vectorize the projection of the reference vector to all remaining vectors. Working off @Amro's solution, I found that an inner loop is actually faster than the matrix subtraction. I do not understand why this would be. Timing this against @Amro's solution, it is about 45% faster.

function [Q, R] = Gram_Schmidt5(A, w)
Q = A;
n_dimensions = size(A, 2);
R = zeros(n_dimensions);
R(1, 1) = norm(Q(:, 1));
Q(:, 1) = Q(:, 1) ./ R(1, 1);
for i = 2 : n_dimensions
    Qw = (Q(:, i - 1) .* w)' * Q(:, (i - 1) : end);
    R(i - 1, i : end) = Qw(2:end) / Qw(1);
    %% Surprisingly this loop beats the matrix multiply
    for j = i : n_dimensions
        Q(:, j) = Q(:, j) - Q(:, i - 1) * R(i - 1, j);
    end
    %% This multiply is slower than above
    %    Q(:, i : end) = ...
    %     Q(:, i : end) - ...
    %     Q(:, i - 1) * R(i - 1, i : end);
    R(i, i) = norm(Q(:,i));
    Q(:, i) = Q(:, i) ./ R(i, i);
end
TallBrianL
  • 1,230
  • 1
  • 9
  • 14
  • Where is the upper triangular matrix, `R` ? – JacobD Nov 26 '14 at 22:34
  • @JacobD You can recover R with R = Q' * A; – TallBrianL Nov 27 '14 at 02:53
  • This doesn't appear to be working the same as my algorithm. One or both of the outputs are different. I don't think the weighting is working correctly. – JacobD Nov 27 '14 at 08:00
  • @JacobD I am getting same outputs, can you give me a sample dataset where the outputs don't match? – TallBrianL Nov 29 '14 at 18:01
  • tall.b.lo For some reason I can't tag your name in the comments. Use the following where `Gram_Schmidt2` is your function and `Gram_Schmidt` is the one I provided. `A=complex( rand(100,10), rand(100,10)); w=rand(100,1); [Q, R] = Orthogonal_Basis_Expansion.Gram_Schmidt(A, w); [Q2, R2] = Orthogonal_Basis_Expansion.Gram_Schmidt2(A, w); zeros1=norm(Q-Q2); zeros2=norm(R-R2);` Here, `zeros2` should be zeros but it is not close to it. – JacobD Nov 29 '14 at 21:22
  • @JacobD, this was bugging the heck out of me. I believe I now have it properly calculating both Q and R. From my measurements, for matrices of 200-300 vectors, it is 2.6 times faster. – TallBrianL Dec 03 '14 at 00:52
  • It is correct now, however, it is not faster. It is running more than 3 times slower than my original function. – JacobD Dec 03 '14 at 01:44
  • @JacobD, got slightly obsessed with this problem, played with it a bit more. Found a faster solution based on Amro's solution. – TallBrianL Dec 05 '14 at 00:03
  • If, for example, I use `A = complex( rand(40000,300), rand(40000,300))` and `w = rand(40000, 1)`, I still don't get a solution faster using your updated method? – JacobD Dec 08 '14 at 12:11