3

This is related to Probable instruction Cache Synchronization issue in self modifying code? I had asked sometime back. Even though the accepted solution solved the related issue I came across a new intermittent failure mode where the CPU tries to jump to a junk address after the function is switched back on. But the disassembly after the fact (using core dump) shows the correct address in the call instruction.

Some gdb analysis follows.

Program terminated with signal 11, Segmentation fault.
#0  0x00000000014010d0 in ?? ()
(gdb) bt
#0  0x00000000014010d0 in ?? ()
#1  0x0000000000492e01 in FastPelY_14 ()
#2  0x000000000045d85d in SubPelBlockMotionSearch ()
#3  0x0000000000467a23 in BlockMotionSearch ()
#4  0x0000000000469c99 in PartitionMotionSearch ()
#5  0x0000000000487bf7 in encode_one_macroblock ()
#6  0x0000000000496ccd in encode_one_slice ()
#7  0x0000000000426081 in code_a_picture ()
#8  0x000000000042766f in frame_picture ()
#9  0x000000000042664b in encode_one_frame ()
#10 0x0000000000430a23 in main ()

(gdb) disas /r 0x0000000000492e01
   0x0000000000492dfc <+38>:    e8 cf e2 f6 ff  callq  0x4010d0 
=> 0x0000000000492e01 <+43>:    8b 45 e4    mov    -0x1c(%rbp),%eax

The interesting thing to note here is that while the correct address is 0x4010d0 the junk address is always 0x14010d0 when it fails. Which makes me think it is the call instruction which failed somehow even-though the instruction pointer is shown as to point the next instruction in the backtrace. (May that's the proper behavior with gdb. I am not quite sure).

So if that's the case, apparently the CPU has tried to call in to e8 cf e2 f6 00 instead of e8 cf e2 f6 ff. The 5 byte sequence which initially lived at the call site starting from 0x0000000000492dfc is a 5 byte NOP (according to the suggestions given in the question linked at the top) of 0x0F1F440000.

Any ideas what's going on here? Please let me know if more context is needed. By the way I am on a Intel(R) Xeon(R) CPU E5-2670 but the behavior seems consistent across couple of other machines I tried.

Edit : The code has been compiled with following additional options with -O2 optimization level.

-fno-optimize-sibling-calls -finstrument-functions

Community
  • 1
  • 1
chamibuddhika
  • 1,419
  • 2
  • 20
  • 36
  • The backtrace may be incorrect, or missing parts. What's the code at the correct address? – Jester Apr 14 '15 at 14:49
  • The disassembly I have given is the code around where the failure seems to happen. Which happens to be at frame#1 in the back trace. – chamibuddhika Apr 14 '15 at 14:57
  • My point is there may be any number of intermediate steps between the two frames that your backtrace shows. Maybe you ended up at the wrong address by going to the correct address first, but then going on. – Jester Apr 14 '15 at 15:00
  • I edited the answer to include the full backtrace. Apart from the last frame others look clean. Hope that helps? And the disassembly is a portion of FastPelY_14 function. – chamibuddhika Apr 14 '15 at 15:05
  • I mean `bt` might not be showing frames (or intermediate steps) **between #0 and #1**, so you might get there from `0x4010d0`. What I am asking is what code is there (at the correct address), if maybe that can subsequently go to the wrong location? – Jester Apr 14 '15 at 15:08
  • I see. You mean like if the stack got corrupted probably right? How can I get to that information if that's the case? – chamibuddhika Apr 14 '15 at 15:14
  • Just realized tail calls can be a reason for such a behavior. I failed to mention I compile my code with -fno-optimize-sibling-calls to disable this optimization for some other reason. – chamibuddhika Apr 14 '15 at 15:24
  • I just checked the registers for #0 and #1. They contain the same values except rip. Perhaps that's a good indication there were no intermediate steps involved? (Since if that's the case the register values would have been different due to perturbations from intermediate calls) – chamibuddhika Apr 14 '15 at 15:40
  • since your modifying code, success or failure is depending on the depth of the prefetch queue and/or pipeline of your processor. It's possible that you modify code too late in the process, when part of it is already in the pipeline. What happens if you add some dummy instructions between modification and call? – mfro Apr 14 '15 at 15:42
  • Hmm. According to my (not so deep) understanding reading some other questions http://stackoverflow.com/questions/17395557/observing-stale-instruction-fetching-on-x86-with-self-modifying-code?lq=1 I was under the impression that such cases should be handled properly in x86. – chamibuddhika Apr 14 '15 at 16:30
  • Intel manual (Vol 3A. 11.6) says "If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache." My interpretation is that it might be costly but safe. – chamibuddhika Apr 14 '15 at 21:13
  • Are you able to reproduce it with the code in your previous question? I have changed the instruction to `0x0000441F0F000000` as per your post, and seems to run properly for extensive amount of time on my amd fx8350. – Jester Apr 14 '15 at 21:24
  • @chamibuddhika - Invalidating the trace cache can indeed be very costly. Intel dropped the trace cache with the Pentium 4/Netburst microarchitecture. However, it recent processors it added a micro op cache so doing self modifying code may again be costly. – Craig S. Anderson Apr 20 '15 at 07:07
  • @CraigS.Anderson: self-modifying code is extremely slow on every recent design (P6, P4, AMD). The whole out-of-order pipeline switches to a special mode that doesn't fetch ahead, when it detects a write to an address near the instruction pointer. Or something like that. It was slow even on PPro-derived cores before Sandybridge introduced the uop cache. – Peter Cordes Jul 14 '15 at 01:12

0 Answers0