The armadillo matrix library writes
Armadillo employs a delayed evaluation approach to combine several operations into one and reduce (or eliminate) the need for temporaries. Where applicable, the order of operations is optimised. Delayed evaluation and optimisation are achieved through recursive templates and template meta-programming.
This means that you can write operations like
arma::mat A, B;
arma::vec c, d;
...
d=(A % B)*c;
and no temporary variables are created. (note that % is the element-wise product operation in armadillo)
I would like to be able to code in a similar style for an OpenCL application.
The libraries I've looked at are VexCL, ViennaCL, Boost.Compute, and clBLAS. VexCL and Boost.Compute don't even provide basic matrix functionality such as multiplication. clBLAS doesn't work as a template library, so you need to manually invoke the operations. ViennaCL provides all the operations I need, but it doesn't seem to be capable of chaining them together.
For example
d= linalg::prod(linalg::element_prod(A,B), c);
fails to compile.
I think there might be some possibility of using VexCL to automatically generate kernels based on the operations Armadillo decides on, but I can't see any way of making that work straightforwardly.
Any suggestions?