4

I am modifying RNNLM a neural net to study language model. However given the size of my corpus it's running real slow. I tried to optimize the matrix*vector routine (which is the one accountable for 63% of total time for small data set (I would expect it to be worse on larger sets)). Right now I am stuck with intrinsics.

    for (b=0; b<(to-from)/8; b++) 
    {
        val = _mm256_setzero_ps();
        for (a=from2; a<to2; a++) 
        {
            t1 = _mm256_set1_ps (srcvec.ac[a]);
            t2 = _mm256_load_ps(&(srcmatrix[a+(b*8+from+0)*matrix_width].weight));
            //val =_mm256_fmadd_ps (t1, t2, t3)
            t3 = _mm256_mul_ps(t1,t2);
            val = _mm256_add_ps (val, t3);
        }
        t4 = _mm256_load_ps(&(dest.ac[b*8+from+0]));
        t4 = _mm256_add_ps(t4,val);
        _mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
    }

This example crashes on:

_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);

However if i change to

_mm256_storeu_ps (&(dest.ac[b*8+from+0]), t4);

(with u for unaligned i suppose) everything works as intended. My question is: why would load work (whereas it is not supposed to, if the data is unaligned) and store doesn't. (furthermore both are operating on the same address).

dest.ac have been allocated using

void *_aligned_calloc(size_t nelem, size_t elsize, size_t alignment=64)
{
    size_t max_size = (size_t)-1;

    // Watch out for overflow
    if(elsize == 0 || nelem >= max_size/elsize)
        return NULL;

    size_t size = nelem * elsize;
    void *memory = _mm_malloc(size+64, alignment);
    if(memory != NULL)
        memset(memory, 0, size);
    return memory;
}

and it's at least 50 elements long. (BTW with VS2012 I have an illegal instruction on some random assignment, so I use linux.)

thank you in advance, Arkantus.

Arkantus
  • 120
  • 1
  • 10
  • what is the value of `from`? Is there a chance that `_mm256_load_ps` intrinsic is actually implemented as 2 128-bit loads? – Come Raczy May 19 '15 at 16:23
  • The value of from when it crash is 891. &(dest.ac[b*8+from+0]) = 0x957e6c . So there is an access in the middle of the table, and this is not aligned. – Arkantus May 20 '15 at 08:03
  • with that value it is even more surprising that the load works. Did you check that you are actually loading the correct values (for that value of from)? – Come Raczy May 20 '15 at 15:36
  • You should check the ASM generated, and see if it's re-computing the array index every time through the inner loop. If so, pull the part that's constant out of the loop. What usually works well is to have the outer loop increment `b` by `8 * matrix_width`, instead of multiplying `b * 8` in the index expression. gcc seems bad at transforming loops to only maintain a scaled version of the loop counter, when you don't write the loop that way. – Peter Cordes Jun 24 '15 at 12:34
  • Also, the `set1` intrinsics can be slow. Be carefully with them. Hopefully that's compiling to a `vbroadcastss ymm, [mem]`. If you can arrange your data structures to not need that in the inner loop, that might be faster. Just exchanging the inner/outer loops, so the same `srcvec` is used for all the `b` values, would be slower because of having to gather the non-contiguous data from `srcmatrix`. `vbroadcastss` is 2 uops, and 5 cycle latency from memory (on Haswell). 1 cycle less with a 128bit dest instead of 256. Throughput is 1 per cycle (can only run on port5 on SnB/IvB/HSW). – Peter Cordes Jun 24 '15 at 12:47

1 Answers1

2

TL:DR: in optimized code, loads will fold into memory operands for other operations, which don't have alignment requirements in AVX. Stores won't.


Your sample code doesn't compile by itself, so I can't easily check what instruction _mm256_load_ps compiles to.

I tried a small experiment with gcc 4.9, and it doesn't generate a vmovaps at all for _mm256_load_ps, since I only used the result of the load as an input to one other instruction. It generates that instruction with a memory operand. AVX instructions have no alignment requirements for their memory operands. (There is a performance hit for crossing a cache line, and a bigger hit for crossing a page boundary, but your code still works.)

The store, on the other hand, does generate a vmov... instruction. Since you used the alignment-required version, it faults on unaligned addresses. Simply use the unaligned version; it'll be just as fast when the address is aligned, and still work when it isn't.

I didn't check your code carefully to see if all the accesses SHOULD be aligned. I assume not, from the way you phrased it to just ask why you weren't also getting faults for unaligned loads. Like I said, probably your code just didn't compile to any vmovaps load instructions, or else even "aligned" AVX loads don't fault on unaligned addresses.

Are you running AVX (without AVX2 or FMA?) on a Sandy/Ivybridge CPU? I assume that's why your FMA instrinsics are commented out.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Yes I am using AVX on a Sandy CPU. and yes some accesses are not aligned ! Thanks ! I'll use the u version now that I understand why ! – Arkantus Jun 24 '15 at 09:32