16

I have a column vector A which is 10 elements long. I have a matrix B which is 10 by 10. The memory storage for B is column major. I would like to overwrite the first row in B with the column vector A.

Clearly, I can do:

for ( int i=0; i < 10; i++ )
{
    B[0 + 10 * i] = A[i];
}

where I've left the zero in 0 + 10 * i to highlight that B uses column-major storage (zero is the row-index).

After some shenanigans in CUDA-land tonight, I had a thought that there might be a CPU function to perform a strided memcpy?? I guess at a low-level, performance would depend on the existence of a strided load/store instruction, which I don't recall there being in x86 assembly?

M. Tibbits
  • 8,400
  • 8
  • 44
  • 59

1 Answers1

8

Short answer: The code you have written is as fast as it's going to get.

Long answer: The memcpy function is written using some complicated intrinsics or assembly because it operates on memory operands that have arbitrary size and alignment. If you are overwriting a column of a matrix, then your operands will have natural alignment, and you won't need to resort to the same tricks to get decent speed.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • I guess I just had hopes of assembly level access to say 'un'-strided load/store instructions for dual&triple channel memory. – M. Tibbits May 16 '11 at 06:42
  • I'm not sure what you mean by 'un-strided' load/store operations. – Dietrich Epp May 16 '11 at 06:49
  • Perhaps just an incorrect perception on my part, but I thought triple channel ram was striped through the address space? If I could write to just one of the memory chips (write only in one channel at a slower speed) that would be the equivalent of a strided memcpy? This would of course depend strongly on the granularity of the striping. – M. Tibbits May 16 '11 at 06:55
  • I'm no longer sure what you mean by 'strided memcpy'. I thought you meant copy from X,X+1,X+2... to Y,Y+N,Y+2*N,... This has little to do with the way RAM is organized. I suggest reading about how modern processors work, especially w.r.t. caching. – Dietrich Epp May 16 '11 at 07:04
  • Yes, that's exactly what I want: Y, Y+N, Y+2N, ... Your comment on caching made me realize that it's impractical to transfer to the system bus just to transpose -- sorry, a bit tired here. Clearly, the code in the question would stay within the L1 cache on my Core i7. – M. Tibbits May 16 '11 at 07:16
  • Its not clear if this is really the case, it is possible to move data faster depending on the alignment of the array. A good memcpy should check the alignment and perform these actions differently if you have the optimal alignment. There is a reasonable chance that the code your compiler emits will not do these checks. There is a bit more to be said about the advantage of using `memcpy`. – Mikhail May 25 '13 at 17:26