I'm trying to optimize my code, taking advantage of multicore processors, to both copy any manipulate large dense arrays.
For copying: I have a large dense array (approximately 6000x100000) from which I need to pull 15x100000 subarrays to do several computations down the pipe. The pipe consists of a lot of linear algebra functions that are being handled by blas, which is multicore. Whether or not the time to pull data will really matter compared to the linear algebra is an open question, but I'd like to err on the side of caution and make sure the data copying is optimized.
For manipulating: I have many different functions that manipulate arrays by with element or row. It would be best if each of these was done multicore.
My question is: is it best to use to right framework (OpenML, OpenCL) and let all the magic happen with the compiler, or are there good functions/libraries that do this faster?