Faster method to copy arrays after typecasting in C?

Question

I have a two-dimesional integer array InArray[2][60] carrying short data in 2 LS bytes and bit field data in 2 MS bytes. Please suggest a faster method to extract short data and copy it to a short OutArray[60], something on the lines for memcpy(). I presume iterating through each item is not the most optimal method of doing this. TIA

EDIT : Adding code snippet

int InArray[2][60];
short OutArray[60];
for (int i=0; i < 60;i++)
{
    OutArray[i] = (short)(InArray[0][i] & 0xffff);
}

Is there a better and possibly faster way of doing this

Did you test with iterating through the array and find that its performance wasn't good enough? — Eric Finn, Jun 20 '13 at 11:56
is that `short InArray[2][60]` or `int InArray[2][60]`. If the latter, there's something else you aren't telling us because you have twice as much data as required. — Tom Tanner, Jun 20 '13 at 12:19
@EricFinn : Not certain, if the performance is good, but I assume in principle, serial data read should always be faster than iteraring element by element ? — Bleamer, Jun 20 '13 at 12:57
@TomTanner The original array is `int InArray[2][60]`, I will be reading the original array in the form of two one-dimensional arrays, over two iterations. — Bleamer, Jun 20 '13 at 13:00
What's LS byte, bit field data, MS byte. Also, I suggest you put here the "inefficient" algorithm, so that we understand exactly what you are up to. — Antonio, Jun 20 '13 at 13:00
@Antonio bit fields are something you can do in C/C++. I suggest googling that. LS and MS generally mean Least Significant and Most Significant. — Eric Finn, Jun 20 '13 at 13:02
@Bleamer What makes you think a `memcpy()` doesn't use a loop itself? Its implementers can't assume that whatever block of data they're being asked to copy will fit in a register. — Eric Finn, Jun 20 '13 at 13:07
@Bleamer So if I understand right, in each `int` in `InArray`, you have stored some data in the two most significant bytes, and other (un?)related data, a `short` (`unsigned short`?) in the two least significant bytes? — Daniel Fischer, Jun 20 '13 at 13:09
@DanielFischer Right, thats the way i would like to understand that. — Bleamer, Jun 20 '13 at 13:11
@EricFinn : Agreed, I am not aware of the internals, a solution to this problem might broaden my understanding. — Bleamer, Jun 20 '13 at 13:13
There isn't a quicker way, IMO. `memcpy()` won't work because you only want half of each `int`, not a contiguous block. I don't know of any solution other than iterating, but if you don't have anything profiled that shows this is a performance bottleneck it sounds like premature optimization to me. — Ken White, Jun 20 '13 at 13:17
I don't think you can do much better than what you have. It _might_ be a wee bit faster to combine two or four `short`s in registers and then write a four or eight-byte quantity at a time, but that may also be significantly slower. — Daniel Fischer, Jun 20 '13 at 13:23
@Bleamer You can look up implementations of memcpy. Here are a few: http://research.microsoft.com/en-us/um/redmond/projects/invisible/src/crt/memcpy.c.htm http://ctan.mirror.ac.za/macros/texinfo/contrib/texinfo-hu/texinfo/lib/memcpy.c Basically, it's implemented with a loop. So a loop of yours will probably do about as well. — Eric Finn, Jun 20 '13 at 13:32
@EricFinn : Thank you that was enlightening, I always thought memcpy() must have been optimized using some SIMD instructions, which does not appear to be the case. — Bleamer, Jun 22 '13 at 02:54
@Bleamer It's possible that a compiler's implementation for a certain target could do that; If you want to be sure, check your compiler's implementation. But if you don't know the target platform, you can't guarantee that it will have SIMD instructions. — Eric Finn, Jun 22 '13 at 04:43

score 2 · Answer 1 · answered Jun 20 '13 at 13:45

2

If you really are copying a 60-element array, then it does not matter.

If the array is larger and/or you are doing it a lot of times, then you'll want to have a look at SIMD instruction sets: SSEx on Intel platforms, Altivec on PPC...

For instance, using SSE4, you may use _mm_packus_epi32() which packs (and saturates) 2*4 32-bit operands into 8 16-bit operands.

Your compiler probably has intrinsics to use those: http://msdn.microsoft.com/en-us/library/hh977022.aspx, http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/PowerPC-AltiVec-Built_002din-Functions.html...

answered Jun 20 '13 at 13:45

Raphaël Saint-Pierre

2,498
1
19
23

Being a novice I would appreciate if you can direct me to similar resources related to ARM targets. Thank you. – Bleamer Jun 22 '13 at 02:51
@Bleamer, which compiler are you using ? – Raphaël Saint-Pierre Jun 28 '13 at 19:43
@Bleamer see the GCC docs: http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html for a reference. Performance does not look good, though, according to the accepted answer to this question: http://stackoverflow.com/questions/9828567/arm-neon-intrinsics-vs-hand-assembly. – Raphaël Saint-Pierre Jul 16 '13 at 22:47

score 1 · Answer 2 · 2013-06-20T15:14:20.447

This is only going to help if you're doing something like this many times. I used Agner Fog's vectorclass to do this (http://www.agner.org/optimize/vectorclass.zip). This is a class to use SSE/AVX. But you'll find the best answer if you add the tags SSE and AVX to your question.

You'll also get better results if you can insure the arrays are 16 byte or 32 byte aligned. In the code below it would also help to make either the width of the arrays equal to 64 (even if you are only going to use 60 elements) or to make the length of the array a multiple of 64.

#include <stdio.h>
#include "vectorclass.h"

void foo(int InArray[2][60],  short OutArray[60]) {
    for (int i=0; i < 60; i++) {
        OutArray[i] = (short)(InArray[0][i] & 0xffff);
    }
}

void foo_vec8s(int InArray[2][60],  short OutArray[60]) {
    int i=0;
    for (; i <(60-8); i+=8) {
        Vec8s v1 = Vec8s().load(&InArray[0][i]);
        Vec8s v2 = Vec8s().load(&InArray[0][i+4]);
        Vec8s out = blend8s<0,2,4,6,8,10,12,14>(v1,v2);
        out.store(&OutArray[i]);
    }
    //clean up since arrays are not a multiple of 64
    for (;i < 60; i++) {
        OutArray[i] = (short)(InArray[0][i] & 0xffff);
    }
}

int main() {
    int InArray[2][60];
    for(int i=0; i<60; i++) { 
        InArray[0][i] = i | 0xffff0000;
    }

    short OutArray1[60] = {0};
    foo(InArray, OutArray1);
    for(int i=0; i<60; i++) {
        printf("%d ", OutArray1[i]);
    } printf("\n");

    short OutArray2[60] = {0};
    foo_vec8s(InArray, OutArray2);
    for(int i=0; i<60; i++) {
        printf("%d ", OutArray2[i]);
    } printf("\n");  
}

Appreciate your response, though I should have been more descriptive in my query, I was seeking a solution for ARM target(vector class appears to be optimized for x86), you have any resources for that ? — Bleamer, Jun 22 '13 at 02:49
SIMD on ARM is called NEON (and NEONv2). It's 4-wide like SSE. It also has FMA (or at least some do). That's all I know. You could probably use similar logic for ARM. I mean make your own vectorclass called Vec8s which runs on eight shorts at once (four ints). I don't know if NEON can do that (efficiently). — , Jun 22 '13 at 05:58

Faster method to copy arrays after typecasting in C?

2 Answers2

Linked