Fastest way to copy memory with stride in C?

Question

I'm trying to copy 1 or 2 colour channels from RGBA image data as quickly as possible (this is the slowest part of my code, and it's slowing the whole app down). Is there a fast way of copying with stride?

The data is simply laid out as RGBARGBARGBA etc., and I need to copy just the R values, or in another case just the RG values.

What I have so far is roughly this to copy the R values:

for(int i=0; i<dataSize; i++){
    dest[i] = source[i*4];
}

For the RG values, I'm doing:

for(int i=0; i<dataSize; i+=2){
    dest[i] = source[i*2];
    dest[i+1] = source[(i*2)+1];
}

All the data is unsigned 1-byte values. Is there a faster way? I've already partially unrolled the loop (doing 64 values per iteration - insignificant speedup beyond that). Platform is Armv7 (iOS), so using NEON (SIMD) might be useful, I've zero experience with that unfortunately!

Changing the data is unfortunately out of the question, it's provided by opengl's readPixels() function, and iOS doesn't support reading as L, LA, RG etc. so far as I've been able to tell.

For RG values `*(uint16_t *)(dest + i) = *(short *)(source + i)` might help. — Chris Lutz, Jun 27 '11 at 08:16
Yes, that might indeed help. I'll give that a go, and profile it - it might just make the difference (I'm at 22fps, and need 25, so even a small difference is enough). And the dollar signs.. what the hell?! Lack of sleep? :D I'll go make a quick edit before anyone notices — , Jun 27 '11 at 08:55
@Chris Lutz, i think their could be typing mistake by @psonic & sign instead of $. — Tirth, Jun 27 '11 at 09:05
Stupid questions -- I assume you've already eliminated these possibilities, but is it possible to use openGL functions to flatten the data to monochrome or something before doing getpixels? Or to alter the video encoding to expect the data in stride format and eliminate the redundant copy? — Jack V., Jun 27 '11 at 11:02
@jack V. Yes, and no. I can use plain RGBA data in opengl, and send RGBA data to the video encoder. Problem is the processing is so complex in RGBA that the app ends up incredibly slow. Because of that I'm using YUV data (with UV being 1/2 resolution, hence the two unflattened textures, one with just Y and the other with UV). Flattening on the GPU is possible, except that the GPU is already right at the limit, and there's nothing much left to optimise there. — , Jun 27 '11 at 11:29
BTW, modern processors have Zero Overhead Loop (ZOL) mechanism, meaning that other than the first setup cycle, the test and branch is done in hardware, hence no penalty. This is why you saw negligible improvement when unrolling the loop. Unrolling is a useful practice, though, but for other purposes (i.e., not to save loop iterations). — ysap, Jul 01 '11 at 00:41
What you want is "extract one plane from interleaved image"; perhaps searching for that helps. "Stride", while not wrong, is usually used for _lines_ rather than pixels. — Pablo H, Jul 13 '23 at 15:42

score 5 · Answer 1 · answered Jun 27 '11 at 08:31

5

If you're OK with iOS4 and above, you might find vDSP and the accelerate framework useful. Check out the documentation for all sorts of image manipulation goodness at warp speed.

#import <Accelerate/Accelerate.h>

I don't know what you do next, but if you're doing any form of calculation on the image data, and want it in floating point form, you can use vDSP_vfltu8 to convert one channel of the source byte data to single precision floating point using a single line like this (excluding the memory management);

vDSP_vfltu8(srcData+0,4,destinationAsFloatRed,1,numberOfPixels)
vDSP_vfltu8(srcData+1,4,destinationAsFloatGreen,1,numberOfPixels)
vDSP_vfltu8(srcData+2,4,destinationAsFloatBlue,1,numberOfPixels)
vDSP_vfltu8(srcData+3,4,destinationAsFloatAlpha,1,numberOfPixels)

If you then need to create an image from the manipulated floating point data, use vDSP_vfuxu8 to go back the other way - so;

vDSP_vfixu8(destinationAsFloatRed,1,outputData+0,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatGreen,1,outputData+1,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatBlue,1,outputData+2,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatAlpha,1,outputData+3,4,numberOfPixels);

Obviously you can just process 1 or 2 channels using the above technique.

The documentation is quite complex, but the results are good.

answered Jun 27 '11 at 08:31

Roger

15,793
4
51
73

I'm doing all the heavy lifting on the GPU with GLSL, and have already optimised that side to the bone. The 'slow' bit is simply getting the data back from the texture, and dropping the unwanted channels because readPixels() only supports RGBA. However, I think vDSP could still be useful, because there are a few gather functions. I'd left this to one side, after taking a quick look at the docs (like you say, it's a bit complex!) but seeing your code there, maybe it's not so bad as I thought. I'll give it a go. – Jun 27 '11 at 09:09
Do you just need to copy the data, or do you do something with it after the copy? – Roger Jun 27 '11 at 09:22
Just a straight RGBA -> R (or RG) copy. The processing is all done, I just need to get the data in the right format for video encoding. – Jun 27 '11 at 09:50
Hmmm. In that case I'm less sure vDSP etc will help, it scores when you need some processing as well because it can do the int to float conversions really fast both ways, but in your case that conversion will just hurt performance and one of the other answers will give better results. I have a feeling that strided memcpy type operations will basically hurt no matter how you do them. It's an interesting problem and I'll mull it over a bit more. – Roger Jun 27 '11 at 10:02
Ah. Yes. I was thinking of vDSP_vgathr ( http://developer.apple.com/library/ios/#documentation/Accelerate/Reference/vDSPRef/Reference/reference.html#//apple_ref/doc/uid/TP40009464 ) but that's going to operate on 32bit values, which is useless in this case. – Jun 27 '11 at 10:14

score 3 · Answer 2 · answered Jun 27 '11 at 10:12

3

As always load and store are the most expensive operations. You could optimize your code in the following fashion:

Load one int (RGBA)
Store the required part in a register (temp variable)
Shift the data to the right place in the temp variable.
Do this until the native prozessor data size is full (4 times for chars on a 32bit machine)
store temp variable to memory.

The code is just fast typed to get the idea across.

unsigned int tmp;
unsigned int *dest;

for(int i=0; i<dataSize; i+=4){
    tmp  = (source[i] & 0xFF);
    tmp |= (source[i+1] & 0xFF) << 8;
    tmp |= (source[i+2] & 0xFF) << 16;
    tmp |= (source[i+3] & 0xFF) << 24;

    *dest++ = tmp;
}

answered Jun 27 '11 at 10:12

wpaulus

121
1

I think you're right, especially as this kind of processing is not cache friendly. Best case, I could go from 4 stores to 1, and perhaps 2 loads to 1, at the expense of a few extra operations. That could well make it fast enough! – Jun 27 '11 at 12:16
As an additional comment: make sure your data is int aligned, as the arm cannot read unaligned data. – wpaulus Jun 27 '11 at 12:26

score 2 · Answer 3 · answered Jun 27 '11 at 08:21

2

Depending on the compiled code, you may want to replace the muliplication by 2 with addition of a second loop index (call it j and advance it by 4):

for(int i=0, j=0; i<dataSize; i+=2, j+=4){
    dest[$i] = source[$j];
    dest[$i+1] = source[$j+1];
}

Alternatively, you can replace the multiplication with a shift by 1:

for(int i=0, j=0; i<dataSize; i+=2, j+=4){
    dest[$i] = source[$i<<1];
    dest[$i+1] = source[($i<<1)+1];
}

answered Jun 27 '11 at 08:21

ysap

7,723
7
59
122

I haven't checked what the compiler emits for this (and I'm not knowledgeable enough about ARM assembler) but in general, multiplication is very expensive. Bit shifting is a valid optimization here. I would give it a try (though the above code is not perfect). – Johannes Rudolph Jun 27 '11 at 08:33
3

IMHO replacing a multiplication with shift is bad advice. this is the compilers concern. – duedl0r Jun 27 '11 at 08:34
Helpful. What's normally the compiler's concern is absolutely my concern just now, so I'll try both and profile. Even a small difference might be enough. – Jun 27 '11 at 09:02
Actually, for any quarter-decent processor today, multiplication is actually a native instruction, and usually can be done in the same number of cycles as addition. – ysap Jun 27 '11 at 13:26
@duedl0r - You, my friend, are wrong here. Although very advanced, compilers have a limited set of heuristics. Usually, in order to take advantage of an architecture's advantages and strength points, one needs to write his code in a *specific* way, so as to *hint* the compiler on how to produce optimal code. – ysap Jun 27 '11 at 13:29
@ysap I don't really believe that a multiplication can be done as fast as an addition. Even if it's a native instruction, that doesn't mean all the instructions take the same number of cycles. If that was the case, why do you suggest to optimize the multiplcations and use addition? – duedl0r Jun 27 '11 at 13:50
@ysap And I also disagree with your compiler optimization comment. Compilers aren't stupid anymore.. :) – duedl0r Jun 27 '11 at 14:01
@duedl0r - first, it is not a matter of belief. It is a fact. Today's processors are capable of performing multiplies in the same latency as addition. For example - ADI's Blackfin (which I mentioned in another answer here) can do two MAC (multiply accumulate) operations per cycle. To absolutely convince you - I was personally the designer (the one who designed the electronics) of the Multiplier unit of ADI's TigerSHARC processor. It was able to perform up to 8 MACs 16-bit per cycle, or two 32-bit MACs, or two floating point multiplies per cycle! – ysap Jun 27 '11 at 14:11
@duedl0r - I suggested the optimization b/c I am not familiar with the ARM architecture used by the OP's hardware. It *might* be that there the multiply is costlier indeed. – ysap Jun 27 '11 at 14:13
@duedl0r - as for the compilers - I never said they are stupid (in fact, it surprised me from time to time to see the code they produce). What I said is that for a given architecture, some specific strengths need to be exploited to gain maximum performance. This not always easily mapped from generic C code. However, compilers have functionality of recognizing specific code constructs and compile them to the most optimal code for that arch. This is very arch. specific, though! – ysap Jun 27 '11 at 14:16
@ysap Fine, and your cycle took 1 second? :) You just can't convince me that a multiplication uses less hardware or takes less time if you have an optimized addition. Maybe in your hardware where you have a slow addition. Generally, it's a design decision of the hardware developers, it might be true with your ADI's Blackfin, but it sounds rather strange for well known cpus like pentium. – duedl0r Jun 27 '11 at 14:30
@duedl0r -LOL, "Maybe in your hardware"... At the time of its design, TigerSHARC was the Rolce Royce of DSP's... I don't know about Pentium, I am a DSP engineer. Intel may have done its own choices (and, BTW, you just named a reason to replace the mul with a shift on an Intel architecture). Instead of arguing, browse for Texas Instruments (www.ti.com), Analog Devices (www.analog.com), ARM (www.arm.com) and any other embedded processing company and see for yourself, if you know how to read a hardware reference manual. – ysap Jun 27 '11 at 14:39
@duedl0r - and to make it clear - you **are right** in that mul takes more hardware/time than addidiotn. The reason that Pentium got as fast as 4GHz years ago while leading DSP's just entered the realm of 1GHz is due to the Pentiums extremely deep pipeline - which enabled the top speeds - in which they might have decided to implement different latencies for different instructions. – ysap Jun 27 '11 at 14:41
@ysap hehe, I guessed you were a DSP engineer :) all you guys want to implement your code in asm. I was a compiler engineer, I do the asm thing one time in the compiler, and then write clean code after that :) nice talking to you though ;) – duedl0r Jun 27 '11 at 14:57

score 2 · Answer 4 · answered Jun 27 '11 at 08:25

2

I'm more of a while guy -- you can convert it to for, I'm sure

i = j = 0;
while (dataSize--) {
    dst[i++] = src[j++]; /* R */
    dst[i++] = src[j++]; /* G */
    j += 2;              /* ignore B and A */
}

As for it being faster, you have to measure.

answered Jun 27 '11 at 08:25

pmg

106,608
13
126
198

Thanks, will try it and profile (probably combined with some of the other suggestions). – Jun 27 '11 at 09:52

misterBadger · Answer 5 · 2013-09-30T05:38:57.287

1

Is your question still actual? I have published my ASM-accelerated function for copying bytes with stride some days ago. It is about twice faster than corresponding C code. You can find it here: https://github.com/noveogroup/ios-aux It can be modified to copy words in case of RG-bytes copying.

UPD: I have discovered that my solution is faster than C-code only in debug mode when compiler's optimization is switched off by default. In release mode C-code is optimized (by default) and works as fast as my ASM-code.

edited Sep 30 '13 at 05:38

answered Sep 05 '13 at 10:33

misterBadger

11
4

No, I managed to optimise the whole memory copy out completely in the end (always the fastest solution!) But I'll bookmark that, it'll be useful if I hit this again (quite possible). – Sep 05 '13 at 14:49

amarcus · Answer 6 · 2014-07-31T00:47:41.167

Hope I'm not too late to the party! I just accomplished something similar on the iPad using ARM NEON intrinsics. I get a 2-3x speed up compared to the other listed answers. Note that the code below keeps only the first channel and requires the data to be a multiple of 32 bytes.

uint32x4_t mask = vdupq_n_u32(0xFF);

for (unsigned int i=0, j=0; i < dataSize; i+=32, j+=8) {

    // Load eight 4-byte integers from the source
    uint32x4_t vec0 = vld1q_u32((const unsigned int *) &source[i]);
    uint32x4_t vec1 = vld1q_u32((const unsigned int *) &source[i+16]);

    // Zero everything but the first byte in each of the eight integers
    vec0 = vandq_u32(vec0, mask);
    vec1 = vandq_u32(vec1, mask);

    // Throw away two bytes for each of the original integers
    uint16x4_t vec0_s = vmovn_u32(vec0);
    uint16x4_t vec1_s = vmovn_u32(vec1);

    // Combine the remaining bytes into a single vector
    uint16x8_t vec01_s = vcombine_u16(vec0_s, vec1_s);

    // Throw away the last byte for each of the original integers
    uint8x8_t vec_o = vmovn_u16(vec01_s);

    // Store to destination
    vst1_u8(&dest[j], vec_o);
}

duedl0r · Answer 7 · 2011-06-27T09:11:37.060

1

The answer from Roger is probably the cleanest solution. It's always good to have a library to keep your code small. But if you only want to optimize C code you can try different things. First you should analyze how big your dataSize is. You then can do heavy loop unrolling, probably combined with copying int's instead of bytes: (pseudo code)

while(dataSize-i > n) { // n being 10 or whatever
   *(int*)(src+i) = *(int*)(dest+i); i++; // or i+=4; depending what you copy
   *(int*)(src+i) = *(int*)(dest+i);
   ... n times
}

and then do the rest with:

switch(dataSize-i) {
    case n-1: *(src+i) = *(dest+i); i++;
    case n-2: ...
    case 1: ...
}

it gets a bit ugly.. but it sure is fast :)

you can optimize even more if you know how dataSize behaves. Maybe it's always a power of 2? Or an even number?

I just realized that you can't copy 4 bytes at once :) but only 2 bytes. Anyway, I just wanted to show you how to end an unrolled loop with a switch statement with only 1 comparison. IMO the only way to get a decent speedup.

edited Jun 27 '11 at 09:11

answered Jun 27 '11 at 08:50

duedl0r

9,289
3
30
45

Actually, you *can* copy more bytes if you're willing to shift, but it probably wouldn't help. Off the top of my head: `(*((short*)dst)++) = (0xFFFF0000 & (*((unsigned*)src)++)) >> 16;` – Jon Purdy Jun 27 '11 at 10:34
Is the switch function to help with the "leftovers" if the data length doesn't divide into the loop size? If so, it's not needed here, but that's useful to know anyway. The data size for this is fixed (there's a few texture sizes, but I know them all in advance). Not power of 2 unfortunately, but they're all "convenient" numbers that divide by 1024. I'm splitting the work into 16 blocks, and running them concurrently (it's for ipad2, so dual core), then unrolling in batches of 64. – Jun 27 '11 at 12:39
@psonice Yes, it's only useful for the leftovers. – duedl0r Jun 27 '11 at 13:58

ysap · Answer 8 · 2011-06-27T13:31:44.643

0

Are you comfortable with ASM? I am not familiar with ARM processors, but on the Analog Devices' Blackfin, this copy is actually FREE, since it can be done in parallel to a compute operation:

i0 = _src_addr;
i1 = _dest_addr;
p0 = dataSize - 1;

r0 = [i0++];
loop _mycopy lc0 = p0;
loop_begin _mycopy;
    /* possibly do compute work here | */ r0 = [i0++] | W [i1++] = r0.l;
loop_end _mycopy;
W [i1++] = r0.l;

So, you have 1 cycle per pixel. Note that as-is, this is good for RG or BA copy. As I said, I am not familiar with ARM and absolutely know nothing about iOS so I am not sure you even have access to ASM code, but you can try looking for that kind of optimizations.

edited Jun 27 '11 at 13:31

answered Jun 27 '11 at 13:14

ysap

7,723
7
59
122

Well, I last did ASM around 1993 on a 6502, so "comfortable" - no. That said, I only had to look W up, so perhaps I could use this (it'll be last resort though as it's way out of my comfort zone). There's no compute work to do here unfortunately, except any arithmetic to do with the copy addresses. – Jun 27 '11 at 15:05

Fastest way to copy memory with stride in C?

8 Answers8