10

I'm trying to optimize my code, taking advantage of multicore processors, to both copy any manipulate large dense arrays.

For copying: I have a large dense array (approximately 6000x100000) from which I need to pull 15x100000 subarrays to do several computations down the pipe. The pipe consists of a lot of linear algebra functions that are being handled by blas, which is multicore. Whether or not the time to pull data will really matter compared to the linear algebra is an open question, but I'd like to err on the side of caution and make sure the data copying is optimized.

For manipulating: I have many different functions that manipulate arrays by with element or row. It would be best if each of these was done multicore.

My question is: is it best to use to right framework (OpenML, OpenCL) and let all the magic happen with the compiler, or are there good functions/libraries that do this faster?

Deverp
  • 141
  • 7
  • 1
    One of my initial thoughts is that if your matrix is stored in row-major order, that you could reference 15-row sections without doing any copying at all. – Vaughn Cato Dec 23 '12 at 18:35
  • Yes, but copying needs to be done eventually as I don't want to change the data. – Deverp Dec 23 '12 at 18:37
  • subarrays are 15x1000000, initial is 6000x100000. 1000000 vs 100000, where is typo? – SergeyS Dec 23 '12 at 18:38
  • 1000000 is the typo. Should be 100000 – Deverp Dec 23 '12 at 19:18
  • 1
    For very large matrices processing at the GPU is the way to go, check out the [ViennaCL](http://viennacl.sourceforge.net) project. – demorge Dec 23 '12 at 23:54

1 Answers1

7

Your starting point should be good old memcpy. Some tips from someone who has for a long time been obsessed by "copying performance".

  1. Read What Every Programmer Should Know About Memory.
  2. Benchmark your systems memcpy performance e.g memcpy_bench function here.
  3. Benchmark the scalability of memcpy when it's run on multiple cores e.g multi_memcpy_bench here. (Unless you're on some multi-socket NUMA HW, I think you won't see much benefit to multithreaded copying).
  4. Dig into your system's implementation of memcpy and understand them. The days you'd find most of the time spent in a solitary rep movsd are long gone; last time I looked at gcc and Intel compiler's CRTs they both varied their strategy depending on the size of the copy relative to the CPU's cache size.
  5. On Intel, understand the advantages of the non cache-polluting store instructions (e.g movntps) as these can achieve significant throughput improvements vs. a conventional approach (you'll see these used in 4.)
  6. Have access to and know how to use a sampling profiler to identify how much of your apps' time is spent in copying operations. There are also more advanced tools which can look at CPU performance counters and tell you all sorts of things about what the various caches are doing etc.
  7. (Advanced topic) Be aware of the TLB and when huge pages can help.

But my expectation is that your copies will be pretty minor overhead compared with any linalg heavy lifting. It's good to be aware of what the numbers are though. I wouldn't expect OpenCL or whatever for CPU to magically offer any improvements here (unless your system's memcpy is poorly implemented); IMHO it's better to dig into this stuff in more detail, getting down to the basics of what's actually happening at the level of instructions, registers, cache lines and pages, than it is to move away from that by layering another level of abstraction on top.

Of course if you're considering porting your code from whatever multicore BLAS library you're using currently to a GPU accelerated linear algebra version, this becomes a completely different (and much more complicated) question (see JayC's comment below). If you want substantial performance gains you should certainly be considering it though.

Community
  • 1
  • 1
timday
  • 24,582
  • 12
  • 83
  • 135
  • `memcpy` is type-unsafe and bad. Use `std::copy`. No, it has no runtime overhead compared to `memcpy`. –  Dec 24 '12 at 10:30
  • 1
    @Zoidberg: Well the reason std::copy has zero overhead is that any decent compiler will replace it with a memcpy/memmove for at least POD types (are there any modern mainstream compilers which don't do this? It wasn't always so), so understanding memcpy performance remains important however you get to it. But yes using std::copy is good advice in such circumstances. See also e.g this answer http://stackoverflow.com/a/4707028/24283 – timday Dec 24 '12 at 11:18
  • I think this answer is excellent regarding CPU (only) concerns. But If OpenCL (in general) is an option, as OP suggests, we also might need to understand costs related to memory transfer between main memory and GPU memory. I know OpenCL is intended so that code could run on a CPU or a GPU as long as there is OpenCL SDKs for either, so it's not necessary that OP would run code on the GPU should he/she choose OpenCL. I just have a feeling that this answer, while quite exhaustive and in far more depth than I ever dreamed of knowing, doesn't quite cover the costs of the whole set of options. – JayC Dec 24 '12 at 17:16