0

I am getting Segmentation fault when running the compilation of following short C code:

#include <pmmintrin.h>
#include <stdio.h>
#include <stdlib.h>
#define VALUE 4242

typedef short int Type;
void threshold(Type *dst, const Type *src, int len)
{
  short int i, N=16;
  short int checkval[] = { VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE };

  const short int* __attribute__ ((aligned (16))) line1;

  __m128i v1, v2, v3:
  v2 = _mm_loadu_si128((__m128i*) checkval );

  for(i = 0; i < len; i+=N)
    {
      line1 = src + N*i;
      //Thisline                                                                                                                                     
      v1 = _mm_loadu_si128((__m128i*) line1 );
      v3 = _mm_cmpgt_epi16( v1, v2 );

      _mm_storeu_si128((__m128i*) (dst + N*i), v3 );

    }
  return;
}

int main()
{
  int N = 1024;
  Type dst[N], src[N];
  int i;
  for(i = 0; i < N; i++)
    src[i] = rand()%VALUE;

  threshold(dst, src, N);
  return 0;
}

I am quite assured the line after the "This line" is giving me problems, but I can't tell if the problem is caused by line1 not fitting inside __m128i or some other error.

As a subquestion - I tried to align the elements in line1, would function _mm_load_si128 be more appropriate (I tried both, but I cannot avoid the fault).

Cr_U
  • 37
  • 6

2 Answers2

1

In threshold,

short int i, N=16;

should be:

short int i, N=8;

This is because there are 8 x short int elements per vector, and pointer arithmetic takes the size of the elements into account (my guess is that you were assuming you needed to work with 16 bytes as the size of a vector ?).

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Yes that was my assumption. Strangely enough I obtained required behaviour only with (N = 1). It works, but I wonder if this really utilizes the performance benefit of vectorization. – Cr_U May 21 '15 at 11:44
  • If you set N=1 you will get correct results in this particular case but there will be many redundant calculations, because your vectors will overlap by 7 elements. You should be benchmarking your optimised code and you should therefore notice that performance is very poor with N=1. – Paul R May 21 '15 at 12:46
1

Before a comment on alignment, please take into account houssam's comment and Paul's answer, corruption comes from here for sure.

Now, src may be not 16-byte aligned. Refering to this post about alignment on static arrays, src is aligned on 2-byte words at least, but maybe not on 16-byte addresses.

So, even if line1 is declared as an aligned pointer, the assignment at first line of the for-loop may break this constraint.

Consider adding __attribute__ ((aligned(8))) in the declaration of src (inside main()).

EDIT

To be clear after my comment: one initial root cause was pointed out by PaulR in his answer, type __m128 has the size of 8 short ints, not 16. One other root cause of failure is a problem with index i. By writing line1 = src+i*N and i+=N, the offset from src is actually increasing by N*N at each iteration.

So you have two exclusive solutions:

  • either change the increment of index i and the end limit of the index : for(i = 0; i < len/N; i++); leave the rest as is.
  • or change the offsets, both for src and dst: line1 = src + i;, and _mm_storeu_si128((__m128i*) (dst + i), v3 );
Community
  • 1
  • 1
Bentoy13
  • 4,886
  • 1
  • 20
  • 33
  • OP is using `_mm_loadu_si128`/`_mm_storeu_si128`, so alignment shouldn't be a problem. – Paul R May 20 '15 at 18:04
  • I tried this approach, but I think I am still missing something. I could only solve this using unaligned versions of functions, which I guess is not optimal. – Cr_U May 21 '15 at 11:47
  • 1
    I think alignment may be eventually a problem, but here the problem is more simple. In a now-deleted comment (sadly), it was pointed out that your offset on `src` is `i*N`, but the index `i` is incremented by `N` and not 1. So you get a seg fault at some point because you try to reach address `src+1023*8`, far away from the actual end of the allocated space of `src`. – Bentoy13 May 21 '15 at 12:15
  • Thank you. This also explains why it only works with N=1. Now I got it to work with N=4, but for some reason not with N=8. – Cr_U May 21 '15 at 13:20
  • This is what I did to make it work, however I can only get 4 short ints inside the vector. Thank you! – Cr_U May 21 '15 at 17:06