I'm using the below gradient descent implementation in Octave for ML.
I tried first to increase number of CPU cores and run Octave multithreaded using OpenBlas but still I didn't get the results I'm looking for, so I tried using Nvidia's toolkit and their Tesla K80 GPU
I'm loading Octave using the drop in nvblas following instructions in this article:
Drop-in Acceleration of GNU Octave
When I checked nvidia-smi I found the GPU to be idle although my testing using a matrix matrix multiplication is yielding ~9 teraflops
Later I came to understand that the matrix vector multiplication used for the above mentioned implementation is not supported as per the nvblas documentation
So my question is there is a gradient descent implementation that uses matrix matrix multiplication or something equivalent that can replace the gradient descent implementation I have?