How strided memcpy(3) works in libvpx

Question

I'm trying to understand the following function in libvpx (vp8/common/reconinter.c):

void vp8_copy_mem16x16_c(unsigned char *src, int src_stride, unsigned char *dst,
                         int dst_stride) {
  int r;

  for (r = 0; r < 16; ++r) {
    memcpy(dst, src, 16);

    src += src_stride;
    dst += dst_stride;
  }
}

(8x8 and 8x4 versions also exist in the same source file.)

It is copying 16 bytes from the src to the dst 16 times, but at the same time, it is adding a custom stride to both src and dst. Without prior knowledge on computer graphics and DSP, I feel very confused of these functions: What's the point of supporting custom strides in src and dst? What are some examples or benefits of using such functions rather than just copying the whole 16 x 16 bytes all together?

Thank you very much!

Update: to make it clear, vp8_copy_mem16x16_c is re-defined as vp8_copy_mem16x16 during build stage when an vector-optimized version is not available on the target platform.

https://github.com/search?q=repo%3Awebmproject%2Flibvpx%20vp8_copy_mem16x16&type=code There are some usages. I think the `_c` version in your question is confusing, you are not interested in understanding the function itself, and you are asking about _usage_ or purpose of the vp8_copy_mem16x16 as an operation. Is that right? `What are some examples or benefits of using such functions` The result is just different when src_stride != dst_stride. — KamilCuk, Aug 04 '23 at 14:24
But why do you think that the blocks are all continuous? That would only be true for stride==16. — HolyBlackCat, Aug 04 '23 at 14:29

score 2 · Accepted Answer · answered Aug 04 '23 at 14:31

Your question is what stride is for, if I'm understanding it correctly.

In the context of libvpx, there's two large use cases for it:

Working with encoding individual blocks in the source stream. If you have an entire image, you can use a source stride equal to <image width + image stride - block width> and a destination stride of 0 (or whatever's needed in your algorithm) to extract a block efficiently. Edit: to be clear, most encoding and decoding video operations work on square or rectangular blocks. JPEG is an example of this, but all mp4 and VP8/9 operations are also block-based. This is a very basic, very often used operation.
While most APIs allow non-power-of-two images, efficient memory access, especially on the GPU, pretty much requires it (or at least it requires some alignment padding). Both the source and the destination can have different such requirements, and both stride arguments come into play here.

In general however, there is a third use case for strides: sprite blitting. Similar to the first point above, you can very efficiently blit sprites to textures (and/or the screen, if there's no double buffering) by using strides to copy memory.

score 2 · Answer 2 · answered Aug 04 '23 at 14:56

Consider two two-dimensional arrays with 16-byte elements, say M16 A[1024][1280] and M16 B[1024][1600], and suppose you want to copy a column from array B to array A, as in:

AColumn = 37;
BColumn = 46;
for (int i = 0; i < 1024; ++i)
    A[i][AColumn] = B[i][BColumn];

The elements of A this loop operates on are A[0][AColumn], A[1][AColumn], A[2][AColumn], and so on. Since the width of A is 1280 elements, the successive elements in the loop are 1280 elements apart in memory, and that is 1280•16 = 20,480 bytes.

Similarly, the successive elements of B in the loop are 1600 elements apart, and that is 1600•16 = 25,600 bytes.

Thus, if we call vp8_copy_mem16x16_c with a src_stride of 25,600 and a dst_stride of 20,480, it can copy a column from B into a column of A. (Also, for src, we pass the address of the first destination element, &A[0][AColumn], and, for dst, we pass &B[0][BColumn].

Different selections of strides could copy a column of one array into a row of another, or vice-versa. vp8_copy_mem16x16_c is a generalized “Copy 16-byte chunks at some regular spacing in memory to destinations at some regular spacing in memory” that can operate on rows, columns, alternating elements (such as every second element of a column), and other arrangements.

For another example, consider struct { M16 m; RGB p; int i; } B[1024]; and M16 A[1024]. We could extract the M16 members of the structures in B to the homogeneous M16 array A with vp8_copy_mem16x16_c(A, sizeof *A, &B[0].m, sizeof *B);.

user3528438 · Answer 3 · 2023-08-04T14:47:00.913

This is trying to copy a 16x16 square block between two images (i.e. 2d array).

The intended usage is to set the src and dst to the beginning position of the source and destination block and set the stride to the width of the entire image.

This function also provide two separate strides for src and dst so that the src and dst does not have be the same width.

Note

"Width" should really be "stride" here because "width" is the valid/visible size of each scanline but "stride" is the allocated size of the scanline. From a memory point of view, it's the stride that matters here, not width.

How strided memcpy(3) works in libvpx

3 Answers3