1

Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call

so _mm256_storeu_pd((double *)cij2,vecC);

I have no idea why this changed anything...

I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.

if( q == 0)
{
    __m256d vecA;
    __m256d vecB;
    __m256d vecC;
    for (int i = 0; i < M; ++i)
      for (int j = 0; j < N; ++j)
      {
        double cij = C[i+j*lda];
        double *cij2 = (double *)malloc(4*sizeof(double));
        for (int k = 0; k < K; k+=4)
        {
          vecA = _mm256_load_pd(&A[i+k*lda]);
          vecB = _mm256_load_pd(&B[k+j*lda]);
          vecC = _mm256_mul_pd(vecA,vecB);
          _mm256_storeu_pd(cij2, vecC);
          for (int x = 0; x < 4; x++)
          {
            cij += cij2[x];
          }

        }
        C[i+j*lda] = cij;
      }

I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.

My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.

After reading the comments I've come to add some clarification. First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.

Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.

Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.

bmelcher
  • 11
  • 3
  • I'm uncertain about `_mm_malloc()`, but the standard `malloc()` and `calloc()` return NULL in the event that allocation fails. Supposing that `_mm_malloc()` does the same, you fail to check for that possibility. If it occurs, a segmentation fault when you subsequently try to use the pointer would not be surprising. – John Bollinger Sep 15 '16 at 17:13
  • 1
    On the other hand, I observe that [`_mm256_storeu_pd()`](https://software.intel.com/en-us/node/524110) seems to require a 256-bit-aligned pointer, but your allocation ensures only 64-bit alignment. Judging solely from the docs, you appear to want to allocate with `_mm_malloc(4*sizeof(double), 256)`. – John Bollinger Sep 15 '16 at 17:18
  • @JohnBollinger: storeu_pd is unaligned store. store_pd is aligned store. And _mm_malloc takes an alignment arg in bytes, not bits, so 64 is 64-byte aligned. _mm_malloc may be failing because `cij2` is never freed. Using automatic storage would be much more sensible. – Peter Cordes Sep 15 '16 at 17:38
  • @PeterCordes, ... and this is why I didn't make an answer out of my (apparently inadequate) reading of docs with which I was previously unfamiliar. – John Bollinger Sep 15 '16 at 17:40
  • This code will not perform well even once you get it working. Writing a scalar loop over vector elements inside the inner loop of a matrix multiply is absolutely horrible. A horizontal add inside the inner loop is bad enough, but doing it this way is just nasty. Especially using a dynamically-allocated buffer. You will have much better results using a well-tuned DGEMM function from a BLAS library. Optimal matrix multiplies for modern CPUs need to be cache-aware, and there are a lot of tricks (e.g. transposing one array in parts as needed). I don't recommend writing your own. – Peter Cordes Sep 15 '16 at 17:45
  • BTW, what is the value of `cij2` when your code segfaults (use a debugger)? Is it NULL? If so, then it's just from allocation failure from leaking memory. – Peter Cordes Sep 15 '16 at 17:47
  • I understand that this isn't the best way to do this but I am also required to write this with n^3 operations for a class. I wish I could just use the BLAS library. – bmelcher Sep 15 '16 at 17:59
  • I don't see anything obvious (not that I know anything about this, but based on curiousity, Google etc) so I'm starting to wonder about the definitions of `A`, `B` and particularly `lda` that aren't shown and any possible nasal demons. –  Sep 15 '16 at 18:30
  • A and B are arrays of data. lda is an integer that is used as a scalar to make the arrays seem 2 dimensional. It's weird but I don't want to rewrite the testing program to pass real 2 dimensional arrays because I feel as if the teacher might not like that. This is for a matrix x matrix multiplication test to see how efficient we can get it without using BLAS/other better algorithms. As I said earlier, it has to be n^3 operations to get credit. – bmelcher Sep 15 '16 at 19:37

1 Answers1

1

You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.

In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).


If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.

Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.

When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).


Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.

    __m256d cij_vec = _mm256_setzero_pd();
    for (int k = 0; k < K; k+=4) {
      vecA = _mm256_load_pd(&A[i+k*lda]);
      vecB = _mm256_load_pd(&B[k+j*lda]);
      vecC = _mm256_mul_pd(vecA,vecB);
      cij_vec = _mm256_add_pd(cij_vec, vecC);  // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
    }
    C[i+j*lda] = hsum256_pd(cij_vec);  // put the horizontal sum in an inline function

For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).

Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.

VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the tag wiki.

Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847