Is this "should not happen" crash an AMD Fusion CPU bug?

Question

My company has started having a number of customers call in because our program is crashing with an access violation on their systems.

The crash happens in SQLite 3.6.23.1, which we ship as part of our application. (We ship a custom build, in order to use the same VC++ libraries as the rest of the app, but it's the stock SQLite code.)

The crash happens when pcache1Fetch executes call 00000000, as shown by the WinDbg callstack:

0b50e5c4 719f9fad 06fe35f0 00000000 000079ad 0x0
0b50e5d8 719f9216 058d1628 000079ad 00000001 SQLite_Interop!pcache1Fetch+0x2d [sqlite3.c @ 31530]
0b50e5f4 719fd581 000079ad 00000001 0b50e63c SQLite_Interop!sqlite3PcacheFetch+0x76 [sqlite3.c @ 30651]
0b50e61c 719fff0c 000079ad 0b50e63c 00000000 SQLite_Interop!sqlite3PagerAcquire+0x51 [sqlite3.c @ 36026]
0b50e644 71a029ba 0b50e65c 00000001 00000e00 SQLite_Interop!getAndInitPage+0x1c [sqlite3.c @ 40158]
0b50e65c 71a030f8 000079ad 0aecd680 071ce030 SQLite_Interop!moveToChild+0x2a [sqlite3.c @ 42555]
0b50e690 71a0c637 0aecd6f0 00000000 0001edbe SQLite_Interop!sqlite3BtreeMovetoUnpacked+0x378 [sqlite3.c @ 43016]
0b50e6b8 71a109ed 06fd53e0 00000000 071ce030 SQLite_Interop!sqlite3VdbeCursorMoveto+0x27 [sqlite3.c @ 50624]
0b50e824 71a0db76 071ce030 0b50e880 071ce030 SQLite_Interop!sqlite3VdbeExec+0x14fd [sqlite3.c @ 55409]
0b50e850 71a0dcb5 0b50e880 21f9b4c0 00402540 SQLite_Interop!sqlite3Step+0x116 [sqlite3.c @ 51744]
0b50e870 00629a30 071ce030 76897ff4 70f24970 SQLite_Interop!sqlite3_step+0x75 [sqlite3.c @ 51806]

The relevant line of C code is:

if( createFlag==1 ) sqlite3BeginBenignMalloc();

The compiler inlines sqlite3BeginBenignMalloc, which is defined as:

typedef struct BenignMallocHooks BenignMallocHooks;
static SQLITE_WSD struct BenignMallocHooks {
  void (*xBenignBegin)(void);
  void (*xBenignEnd)(void);
} sqlite3Hooks = { 0, 0 };

# define wsdHooksInit
# define wsdHooks sqlite3Hooks

SQLITE_PRIVATE void sqlite3BeginBenignMalloc(void){
  wsdHooksInit;
  if( wsdHooks.xBenignBegin ){
    wsdHooks.xBenignBegin();
  }
}

And the assembly for this is:

719f9f99    mov     esi,dword ptr [esp+1Ch]
719f9f9d    cmp     esi,1
719f9fa0    jne     SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fa2    mov     eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]
719f9fa7    test    eax,eax
719f9fa9    je      SQLite_Interop!pcache1Fetch+0x2d (719f9fad)
719f9fab    call    eax ; *** CRASH HERE ***
719f9fad    mov     ebx,dword ptr [esp+14h]

The registers are:

eax=00000000 ebx=00000001 ecx=000013f0 edx=fffffffe esi=00000001 edi=00000000
eip=00000000 esp=0b50e5c8 ebp=000079ad iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202

If eax is 0 (which it is), the zero flag should be set by test eax, eax, but it's non-zero. Because the zero flag isn't set, je doesn't jump, and then the app crashes trying to execute call eax (00000000).

Update: eax should always be 0 here because sqlite3Hooks.xBenignBegin is not set in our build of the code. I could rebuild SQLite with SQLITE_OMIT_BUILTIN_TEST defined, which would turn on #define sqlite3BeginBenignMalloc() in the code and omit this code path entirely. That may solve the issue, but it doesn't feel like a "real" fix; what would stop it happening in some other code path?

So far the common factor is that all customers are running "Windows 7 Home Premium 64-bit (6.1, Build 7601) Service Pack 1" and have one of the following CPUs (according to DxDiag):

AMD A6-3400M APU with Radeon(tm) HD Graphics (4 CPUs), ~1.4GHz
AMD A8-3500M APU with Radeon(tm) HD Graphics (4 CPUs), ~1.5GHz
AMD A8-3850 APU with Radeon(tm) HD Graphics (4 CPUs), ~2.9GHz

According to Wikipedia's AMD Fusion article, these are all "Llano" model AMD Fusion chips based on the K10 core and were released in June 2011, which is when we first started getting reports.

The most common customer system is the Toshiba Satellite L775D, but we also have crash reports from HP Pavilion dv6 & dv7 and Gateway systems.

Could this crash be caused by a CPU error (see Errata for AMD Family 12h Processors), or is there some other possible explanation that I'm overlooking? (According to Raymond, it could be overclocking, but it's odd that just this specific CPU model is affected, if so.)

Honestly, it doesn't seem possible that it's really a CPU or OS error, because the customers aren't getting bluescreens or crashes in other applications. There must be some other, more likely, explanation--but what?

Update 15 August: I've acquired a Toshiba L745D notebook with an AMD A6-3400M processor and can reproduce the crash consistently when running the program. The crash is always on the same instruction; .time reports anywhere from 1m30s to 7m of user time before the crash. One fact (that may be pertinent to the issue) that I neglected to mention in the original post is that the application is multi-threaded and has both high CPU and I/O usage. The application spawns four worker threads by default and posts 80+% CPU usage (there is some blocking for I/O as well as for mutexes in the SQLite code) until it crashes. I modified the application to only use two threads, and it still crashed (although it took longer to happen). I'm now running a test with just one thread, and it hasn't crashed yet.

Note also that it doesn't appear to be purely a CPU load problem; I can run Prime95 without errors on the system and it will boost the CPU temperature to >70°C, while my application barely gets the temperature above 50°C while it's running.

Update 16 August: Perturbing the instructions slightly makes the problem "go away". For eaxmple, replacing the memory load (mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]) with xor eax, eax prevents the crash. Modifying the original C code to add an extra check to the if( createFlag==1 ) statement changes the relative offsets of various jumps in the compiled code (as well as the location of the test eax, eax and call eax statements) and also seems to prevent the problem.

The strangest result I've found so far is that changing the jne at 719f9fa0 to two nop instructions (so that control always falls through to the test eax, eax instruction, no matter what the value of createFlag/esi is) allows the program to run without crashing.

Just a note: there's always the possibility that some *other* piece of code is jumping to that instruction. It's not likely, but it's worth keeping in mind that the previous instruction doesn't *have* to be `je`. (And wow... I +1'd this, refreshed, and it went up by 4 more! :D) — user541686, Aug 10 '11 at 00:51
It's almost certainly not a CPU bug. Have you considered making a simpler test case? — Oliver Charlesworth, Aug 10 '11 at 00:51
@Mehrdad: Yes, some code could just jump there, but it did a really good job of faking the callstack if so. — Bradley Grainger, Aug 10 '11 at 01:01
@Oli Charlesworth: I know that it's always your own code and *never* the compiler, and *never* the OS, and *never* the CPU ... but this one's just really weird. I have yet to get a AMD A6 system for testing--this dump was acquired from a customer's machine. Once I get a machine and can reproduce locally, making a simpler test case will be a high priority. — Bradley Grainger, Aug 10 '11 at 01:03
@Bradley: A `jmp`/`je`/etc. wouldn't show anything on the call stack -- it could just be a compiler optimization issue with a branch it thought would never be taken, due to undefined behavior or something. — user541686, Aug 10 '11 at 01:06
I have to agree with Oli. It would be extremely unlikely for an issue as fundamental as `test` setting incorrect flags to not be caught in internal QA testing. Especially since this `test then jump` operation seems to be a very common compiler optimization that is used in a great number of programs. — aroth, Aug 10 '11 at 01:11
@Mehrdad I understand that a `jmp`/`jcc` won't affect the callstack. I'm assuming (maybe incorrectly?) that `sqlite3PcacheFetch` called `pcache1Fetch` (as shown by the callstack) and that `pcache1Fetch` began executing and hasn't yet returned to its caller. The crashing code is just 0x2B bytes into the method (almost right after the prologue), and nothing later in the method jumps back to that address--all loops in the method are after this. My reading of the disassembly indicates that the only way to normally reach this instruction is linear execution from the beginning of the method. — Bradley Grainger, Aug 10 '11 at 01:48
@aroth: I want to agree with Oli, too: it really does seem impossible that this would be a CPU error. My SO question title was deliberately a little provocative to get attention... I certainly don't mean to imply that my first instinct is to blame the CPU, OS, or compiler. As Oli pointed out, a simpler test case is really called for here. — Bradley Grainger, Aug 10 '11 at 01:52
does this same problem occur if you force `sqlite3BeginBenignMalloc` to not be inline (via `__declspec(noinline)` or `__attribute__((no_inline))`? (if anything, not inlining the method may make it a little easier to debug). — Necrolis, Aug 10 '11 at 07:05
@Necrolis: I plan to try to answer questions like that once I get an A6 system for testing. (Another change to test would be removing the method entirely, since it's actually always a no-op.) — Bradley Grainger, Aug 10 '11 at 14:16
I just wanted to chime in and say that was a very well written question. +1 — gahooa, Aug 11 '11 at 20:52
I am not sure, but when it says Windows-64, why the hell are the addresses (and EAX too) seem to indicate that it is a 32 bit system (I am no windows guy so maybe it runs all user programs in 32 bit mode, which I would found for a 64 bit system strange)? — flolo, Aug 15 '11 at 17:49
@flolo: This is a 32-bit process running on 64-bit Windows (WOW64); this output is normal. — Bradley Grainger, Aug 15 '11 at 22:35

Bradley Grainger · Accepted Answer · 2014-09-27T04:22:16.130

I spoke to an AMD engineer at the Microsoft Build conference about this error, and showed him my repro. He emailed me this morning:

We have investigated and found that this is due to a known errata in the Llano APU family. It can be fixed via a BIOS update depending on the OEM – if possible please recommend it to your customers (even though you have a workaround).

In case you’re interested, the errata is 665 in the Family 12h Revision Guide (see page 45): http://support.amd.com/TechDocs/44739_12h_Rev_Gd.pdf#page=45

Here's the description of that erratum:

665 Integer Divide Instruction May Cause Unpredictable Behavior

Description

Under a highly specific and detailed set of internal timing conditions, the processor core may abort a speculative DIV or IDIV integer divide instruction (due to the speculative execution being redirected, for example due to a mispredicted branch) but may hang or prematurely complete the first instruction of the non-speculative path.

Potential Effect on System

Unpredictable system behavior, usually resulting in a system hang.

Suggested Workaround

BIOS should set MSRC001_1029[31].

This workaround alters the DIV/IDIV instruction latency specified in the Software Optimization Guide for AMD Family 10h and 12h Processors, order# 40546. With this workaround applied, the DIV/IDIV latency for AMD Family 12h Processors are similar to the DIV/IDIV latency for AMD Family 10h Processors.

Fix Planned

No

The "665 Integer Divide" problem is discussed in the Passmark forums: http://www.passmark.com/forum/showthread.php?3656-AMD-llano-A-series-benchmark-and-CPU-bug A comment there says the problem only happens with dual-channel RAM. So a computer with one 4GB RAM stick and a Llano CPU will probably be fine without the BIOS fix. But if you spend $20 to upgrade to 8 GB, you'll have problems -- which you'll probably (wrongly!) blame on the RAM. Unfortunately, the BIOS "fix" causes a >80% slowdown in Passmark's integer math benchmark, and a >30% reduction in the Passmark score. — Dave Burton, Jun 25 '12 at 02:30

wallyk · Answer 2 · 2011-08-10T07:03:20.220

1

I'm a bit concerned that the code generated for if (wsdHooks.xBenignBegin) isn't very general. It assumes the only true value is 1 whereas it should really be testing for any nonzero value. Still, MSVC is sometimes baffling that way. It is probably nothing. Never mind: these instructions are for C code not presented.

Given that the eflag Z bit is clear and EAX is zero, the code did not get here by executing the instruction

719f9fa7    test    eax,eax

There must be a jump from somewhere else to the instruction following (719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d) or even the call instruction itself.

Another complication is that with the x86 family, it is common for an invalid jump target (like the second byte of the JE instruction) to execute unperturbed (no faults) for quite a few instructions, often eventually getting back on the proper instruction alignment. Said another way, you may not be looking for a jump to the beginning of any of these instructions: a jump might be in the midst of their bytes, resulting in executing unremarkable operations like add [al+ebp],al which tend not to be noticed.

I predict that a breakpoint at the test instruction will not be hit for the exception. The only ways to find such causes is either to be very lucky, or to suspect everything and prove them innocent one-by-one.

edited Aug 10 '11 at 07:03

answered Aug 10 '11 at 06:50

wallyk

56,922
16
83
148

Regarding your first paragraph: `test` only sets ZF if `eax & eax` equals zero, so it's pretty safe with the subsequent `je`. – Michael Foukarakis Aug 10 '11 at 06:59
you asm analysis is off, the check vs 1 is because the C code is checking vs 1, as its for the line `if( createFlag==1 ) sqlite3BeginBenignMalloc();` not `if (wsdHooks.xBenignBegin)` (see the OP's comment on `sqlite3BeginBenignMalloc` being inlined) – Necrolis Aug 10 '11 at 07:01
@Michael Foukarakis: a fair point so I've redacted my comment. – wallyk Aug 10 '11 at 07:04
2

I'm not trying to be argumentative, but I think your theory is contradicted by my finding that replacing the `jne` before the `test` with `nop` instructions seemingly prevents the crash. (100% reproducible without that change, 0% reproducible in one day of testing with it.) If some other instruction is jumping into the middle of the `je` or directly to the `call`, it wouldn't be affected by that change. Also, how does the theory that some other code is jumping to the `je` or the `call` explain that this only happens on Llano APUs? – Bradley Grainger Aug 17 '11 at 04:55

score -1 · Answer 3 · answered Aug 15 '11 at 17:23

-1

Before considering the possibility of a CPU bug, try to rule out the more probable causes

A different code path to the call instruction. Use the uf command to disassemble the function and look for other jumps / branches to the call instruction
Jump / call to 0 from the hook function. dps SQLite_Interop!sqlite3Hooks l 2 and verify that it shows nulls.

answered Aug 15 '11 at 17:23

John

5,561
1
23
39

1. (I replied with this in a comment already, but didn't update the original question, so it wasn't obvious; my apologies.) The crashing code is 0x2B bytes into the function (almost right after the prologue). I unassembled the whole function and there are no jumps back that early; this statement occurs before the loops in the function body. A jump from outside of the function is possible, of course, but is hard to reconcile with the stack. 2. `dps SQLite_Interop!sqlite3Hooks l 2` shows `00000000 00000000`. – Bradley Grainger Aug 15 '11 at 22:37
I also set a data breakpoint (`ba w 4 SQLite_Interop!sqlite3Hooks`) and there were no writes to this address before the crash. – Bradley Grainger Aug 15 '11 at 22:51