0

I have the following part in my asm assembly code

"LOOP%=:\n\t"
       "movapd  (%%eax), %%xmm4\n\t"
       "addl    $32, %%eax\n\t"
       "movsd   (%%edx), %%xmm5\n\t"
       "addl    $16, %%edx\n\t"
       "movapd  %%xmm4, %%xmm6\n\t"
       "subl    $1, %%ecx\n\t"
       "unpcklpd %%xmm5, %%xmm5\n\t"
       "testl   %%ecx, %%ecx\n\t"
       "mulpd   %%xmm5, %%xmm6\n\t"
       "movsd   -8(%%edx), %%xmm7\n\t"
       "addpd   %%xmm6, %%xmm0\n\t"
       "movapd  -16(%%eax), %%xmm6\n\t"
       "unpcklpd %%xmm7, %%xmm7\n\t"
       "mulpd   %%xmm6, %%xmm5\n\t"
       "addpd   %%xmm5, %%xmm1\n\t"
       "mulpd   %%xmm7, %%xmm4\n\t"
       "addpd   %%xmm4, %%xmm2\n\t"
       "mulpd   %%xmm6, %%xmm7\n\t"
       "addpd   %%xmm7, %%xmm3\n\t"
       "jne LOOP%=\n\t" */

This code holds in %ecx a "loop index", while scanning two (double *) arrays A and B performing some computation using SSE2. Both arrays have been aligned to 64Bytes (aligned to cache line so the 16Byte alignment requirement of SSE is satisfied). %eax holds a "pointer" to array A and "edx" holds a "pointer" to array B. It runs correctly and there is no memory read error. I am wondering why do I have to do

       "movapd  (%%eax), %%xmm4\n\t"
       "addl    $32, %%eax\n\t"
       "movsd  (%%edx), %%xmm5\n\t"
       "addl    $16, %%edx\n\t"
       ......
       "movsd   -8(%%edx), %%xmm7\n\t"
       ......
       "movapd  -16(%%eax), %%xmm6\n\t"
       ......

So I change the initial version to

   "LOOP%=:\n\t"
       "movapd  (%%eax), %%xmm4\n\t"
       "movsd   (%%edx), %%xmm5\n\t"
       "movapd  %%xmm4, %%xmm6\n\t"
       "subl    $1, %%ecx\n\t"
       "unpcklpd %%xmm5, %%xmm5\n\t"
       "testl   %%ecx, %%ecx\n\t"
       "mulpd   %%xmm5, %%xmm6\n\t"
       "movsd   8(%%edx), %%xmm7\n\t"
       "addl    $16, %%edx\n\t"
       "addpd   %%xmm6, %%xmm0\n\t"
       "movapd  16(%%eax), %%xmm6\n\t"
       "addl    $32, %%eax\n\t"
       "unpcklpd %%xmm7, %%xmm7\n\t"
       "mulpd   %%xmm6, %%xmm5\n\t"
       "addpd   %%xmm5, %%xmm1\n\t"
       "mulpd   %%xmm7, %%xmm4\n\t"
       "addpd   %%xmm4, %%xmm2\n\t"
       "mulpd   %%xmm6, %%xmm7\n\t"
       "addpd   %%xmm7, %%xmm3\n\t"
       "jne LOOP%=\n\t"

But then I suffer from a segmentation fault for invalid read.

It appears funny to me. Why?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • 3
    It would be far faster for you to run this under a debugger and see which instruction causes the fault and then look at the registers to see what it was doing instead of asking us. – wallyk Feb 04 '16 at 17:38
  • Please do not post code underneath the title "segfault" and then say it runs correctly. – Weather Vane Feb 04 '16 at 17:39
  • I don't see why you changed code that was working correctly in the first place. – Weather Vane Feb 04 '16 at 17:57
  • So the code you posted is the compiler generated code? And when you altered it, it failed? – Weather Vane Feb 04 '16 at 18:09
  • You are more apologetic than feared by the community. Just post a good question. You started by saying "my asm assembly code" and now you say it isn't? – Weather Vane Feb 04 '16 at 18:20
  • 1
    "I would like to know whether the compiler is arranging computation and allocating registers as I would expect." That is unlikely, I would expect the compiler writers to be ahead of you. – Weather Vane Feb 04 '16 at 18:23

1 Answers1

2

This is the cause:

   "testl   %%ecx, %%ecx\n\t"

The result of this test is used in the condition for the loop at the very end of this code. With move of add operations you ovewrite the flags for the condition so it's always satisfied and runs forever until leaving the memory.

Zbynek Vyskovsky - kvr000
  • 18,186
  • 3
  • 35
  • 43
  • 3
    @AlphaBetaGamma, look up each instruction you use, to see what effect is has on the flags. Some operations set various flags as a by-product, some don't set any (existing flags survive), some have the distinct purpose to set flags. – Weather Vane Feb 04 '16 at 18:05
  • 1
    @AlphaBetaGamma: also, it will perform better to put the flag-setting instruction right next to the conditional branch, so they can macro-fuse into one compare-and-branch operation internally in the CPU. e.g. `dec %ecx` / `jnz` will be good. See http://agner.org/optimize/ – Peter Cordes Feb 04 '16 at 18:17
  • @PeterCordes: That is contrary to optimization techniques used in the 1990s. There one would separate the flag setting from the test by an instruction or two so that separate pipelines could progress and none would be stalled waiting for a computation. – wallyk Feb 04 '16 at 21:58
  • @wallyk: Yep, changes in CPU design have led to changes in software optimization. software-pipelining of short dependency chains isn't helpful. OOO execution will typically "see" many iterations forward for the loop counter (and can get started working on loads for future iterations as soon as their address is ready, which often doesn't depend on the main loop-carried dependency chain). compare-and-branch is such a common pattern that boosting insn throughput by fusing them is worth it, but this can only happen when they're adjacent so they don't have to be tracked separately to retirement. – Peter Cordes Feb 05 '16 at 01:45