The code below generates the following output:
6 6 0 140021597270387
which means that only the first two positions are calculated correctly. However, I am dealing with longs (4 bytes) and __m128i can hold more than 4 longs.
long* AA = (long*)malloc(32*sizeof(long));
long* BB = (long*)malloc(32*sizeof(long));
for(i = 0; i<4;i++){
AA[i] = 2;
BB[i] = 3;
}
__m128i* m1 = (__m128i*) AA;
__m128i* m2 = (__m128i*) BB;
__m128i m3 = _mm_mul_epu32(m1[0],m2[0]);
long* CC = (long*) malloc(16 * sizeof(long));
CC = (long*)&m3;
for (i = 0; i < 4; i++)
printf("%ld \n",CC[i]);
To allocate:
long* AA = (long*) memalign(16 * sizeof(long), 16);
(and the remaining vectors) generates a seg. fault. Can somebody comment?
Thanks