SSE SIMD Segmentation Fault when using resulting float

Question

I'm trying to use Intel Intrinsics to perform an operation quickly on a float array. The operations themselves seem to work fine; however, when I try to get the result of the operation into a standard C variable I get a SEGFAULT. If I comment the indicated line below out, the program runs. If I save the result of the indicated line, but do not manipulate it in any way, the program runs fine. It is only when I try to (in any way) interact with the result of _mm_cvtss_f32(C) that my program crashes. Any ideas?

float proc(float *a, float *b, int n, int c, int width) {
    // Operation: SUM: (A - B) ^ 2
    __m128 A, B, C;
    float total = 0;
    for (int d = 0, k = 0; k < c; d += width, k++) {
        for (int i = 0; i < n / 4 * 4; i += 4) {
            A = _mm_load_ps(&a[i + d]);
            B = _mm_load_ps(&b[i + d]);
            C = _mm_sub_ps(A, B);
            C = _mm_mul_ps(C, C);
            C = _mm_hadd_ps(C, C);
            C = _mm_hadd_ps(C, C);
            total += _mm_cvtss_f32(C); // SEGFAULT HERE
        }
        for (int i = n / 4 * 4; i < n; i++) {
            int diff = a[i + d] - b[i + d];
            total += diff * diff;
        }
    }
    return total;
}

Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the `_mm_cvtss_f32()` line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the `a` and `b` arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (the `movaps` has a shorter instruction encoding than `movups`, but that's about it). — Jason R, Nov 16 '16 at 19:50
Thank you, I changed the `load` to a `loadu` and it seems to work now! — Simon, Nov 16 '16 at 20:05
@JasonR: Their encoding is the same length. http://www.felixcloutier.com/x86/MOVAPS.html vs. http://www.felixcloutier.com/x86/MOVUPS.html. If you were comparing disassembly, did one of them have a REX prefix, or a different addressing mode? Anyway, they perform identically when the data is aligned at run-time, but when L1 cache read bandwidth is a bottleneck, aligned loads have an advantage. It's a good idea to make sure your data is aligned when it's cheap. — Peter Cordes, Nov 16 '16 at 20:12
Also, @Simon: take the HADD instructions out of the loop, and do the horizontal sum at the end. Use `_mm_add_ps` to keep add the vector of results to a vector accumulator. (Preferably unroll with multiple vector accumulators to hide the latency of the add to the accumulator, so it doesn't bottleneck on the latency of the loop-carried dependency chain. Auto-vectorization might work well here, esp. if you tell the compiler the pointers are aligned.) — Peter Cordes, Nov 16 '16 at 20:14
@PeterCordes: I definitely defer to your expansive x86 expertise. I didn't realize that aligned loads had any remaining advantages nowadays. Do you have a reference for the L1-cache-read-bandwidth issue that you pointed out above? — Jason R, Nov 16 '16 at 20:23
@JasonR: I forget where exactly I read this, but a load that crosses a cache-line boundary needs to access both lines. The hardware takes care of this with only a small latency penalty, and also a throughput penalty since it takes two cache read-port operations. Page-splits are even worse: Hundreds of cycles latency until Skylake (5 cycles). Unaligned loads are no more expensive when no cache-line boundary is crossed, though. So if your data is almost always aligned at run-time, it makes sense to use MOVUPS. No wasted time checking for alignment, and fast-enough in the rare case. — Peter Cordes, Nov 16 '16 at 20:29
@JasonR: Also, IIRC Intel's optimization manual mentions that unaligned 32B YMM loads or stores are subject to false dependencies, or something like that. I'd have to look it up since I forget the details, but apparently memory disambiguation doesn't work as well for 32B unaligned as for 32B aligned (or 16B unaligned). I can't remember if this applies to only SnB/IvB (where 32B memory ops are split in two), or also HSW/SKL. — Peter Cordes, Nov 16 '16 at 20:32
@JasonR: Oh and also, `_mm_load_ps` can be folded into a memory operand for SUBPS, but `loadu` can only be used as a memory operand with AVX. Without AVX, the compiler must emit a separate MOVUPS instruction (unless it can prove that the addresses will in fact always be aligned at run-time). Separate load instructions increase code size and fused-domain uop count, since they can micro-fuse ([except for indexed addressing modes on some SnB-family CPUs](http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695)). — Peter Cordes, Nov 16 '16 at 20:35
@JasonR: So there's no advantage to ever actually using a MOVAPS instruction in asm vs. a MOVUPS, except to detect misaligned data while tuning code that you think should only be using aligned. But there are advantages to actually aligning your data. There are also advantages to telling the compiler about this, especially when auto-vectorizing instead of doing it manually with intrinsics. (gcc likes to fully-unroll intro loops to do scalar until an aligned pointer is reached. So promising the compiler that the input is already aligned avoids a lot of code bloat.) — Peter Cordes, Nov 16 '16 at 20:40
@ JasonR and @Peter Corders, I can't begin to thank you both enough! I unrolled my loop and performed all my additions at the end and that effectively increased my speedup by 25%! Although I'm not proficient enough in SSE/AVX intrinsics to fully understand your discussion, I learned a lot from this. Thanks again! — Simon, Nov 16 '16 at 23:10

score 1 · Accepted Answer · answered Nov 17 '16 at 15:37

Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the _mm_cvtss_f32() line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the a and b arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (see the comments on the question above for a discussion of the issue).

I mentioned in my original comment that movaps has a shorter encoding than movups. This is not correct. I was thinking instead of movaps versus movapd, which do the same memory transfer, only they're labeled as being for single-precision and double-precision data, respectively. In practice, they do the same thing, but movaps has a shorter encoding.

SSE SIMD Segmentation Fault when using resulting float

1 Answers1