1

The armadillo matrix library writes

Armadillo employs a delayed evaluation approach to combine several operations into one and reduce (or eliminate) the need for temporaries. Where applicable, the order of operations is optimised. Delayed evaluation and optimisation are achieved through recursive templates and template meta-programming.

This means that you can write operations like

arma::mat A, B;
arma::vec c, d;
...
d=(A % B)*c;

and no temporary variables are created. (note that % is the element-wise product operation in armadillo)

I would like to be able to code in a similar style for an OpenCL application.

The libraries I've looked at are VexCL, ViennaCL, Boost.Compute, and clBLAS. VexCL and Boost.Compute don't even provide basic matrix functionality such as multiplication. clBLAS doesn't work as a template library, so you need to manually invoke the operations. ViennaCL provides all the operations I need, but it doesn't seem to be capable of chaining them together.

For example

    d= linalg::prod(linalg::element_prod(A,B), c);

fails to compile.

I think there might be some possibility of using VexCL to automatically generate kernels based on the operations Armadillo decides on, but I can't see any way of making that work straightforwardly.

Any suggestions?

Jeremy Salwen
  • 8,061
  • 5
  • 50
  • 73

2 Answers2

3

You might want to check out ArrayFire.

ArrayFire is a matrix based library with a JIT compilation engine which allows you to combine operations into a single kernel. This dramatically reduces the number of kernel calls for basic element wise operations that you posted above. For example the code you posted can be written as:

array A = randu(5, 5);       // 5x5 Matrix
array B = randu(5, 5);       // 5x5 Matrix
array c = constant(1, 1, 5); // 1x5 Matrix

array d = (A % B) + tile(c, 5);

In this example the modulus and the addition will be performed in one OpenCL kernel. No temporaries are created. We also have backends for single threaded CPU, CUDA, and OpenCL.

Disclosure: I am one of the developers of the ArrayFire library.

Umar Arshad
  • 970
  • 1
  • 9
  • 22
  • Does arrayfire divide computation to all gpus&cps or does it need explicit device management? – huseyin tugrul buyukisik Mar 08 '15 at 10:04
  • Does arrayfire support JITing matrix multiplication in combination with other operations? there was a typo in my original question, it was supposed to be matrix multiplication. – Jeremy Salwen Mar 08 '15 at 11:25
  • ArrayFire does not JIT the matrix multiplication. Operations like matrix multiplication will typically be more performant as standalone kernels. – Umar Arshad Mar 08 '15 at 19:19
  • ArrayFire will not perform load balancing but you can easily perform your operation on multiple devices using the af::setDevice function. Aside from simple element wise operations, it is nearly impossible to have good results from automatic load balancing. It is best to let the user decide how to manage the data. – Umar Arshad Mar 08 '15 at 20:49
2

The operation from your example may be written in VexCL with a recently added tensordot() operation:

vex::vector<double> A(ctx, n * m), B(ctx, n * m);
vex::vector<double> c(ctx, n), d(ctx, n);

vex::slicer<1> s1(vex::extents[n]);    // shape of the vectors
vex::slicer<2> s2(vex::extents[n][m]); // shape of the matrices

using vex::_;
d = vex::tensordot(s2[_](A * B), s1[_](c), vex::axes_pairs(1, 0));

This will result in a very straightforward implementation of the matrix-vector product (so probably not as effective as a vendor-provided BLAS kernel). But no temporaries will be involved and a single kernel will be launched.

One caveat: tensordot() may only be used with single-device contexts.

ddemidov
  • 1,731
  • 13
  • 15