3

Recently, I am facing a - to me - strange behavior in my embedded software.

What I got: Running a 32 bit AVR32 controller, starting the program from an external SDRAM, as the file size is too big to start it directly from the micro-controller flash. Due to the physical memory map, the memory areas are split between:

stack (start at 0x1000, length of 0xF000) ( < 0x1000 is protected by the MPU)

EBI SDRAM (start at 0xD0000000, length of 0x00400000).


What happens: Unfortunately I got an exception, which is not reproducible. Looking at my given stack trace, the following event irregular occurs:

Name: Bus error data fetch - Event source: Data bus - Stored Return Address: First non-completed instruction

Additionally, the stack pointer has a valid value, whereas the address where the exception occurs (last entry point for fetching instructions), points into the memory nirvana (e.g. 0x496e6372, something around 0x5..., 0x6....). I guess, this has to be the "First non-completed instruction", the manual is talking about. However, the line in my source code is always the same: accessing a member function from a data array via pointer.

      if(mSomeArray[i])
      {
         mSomeArray[i]->someFunction(); <-- Crash
      }

The thing is: adding or deleting other source code makes the event disappear and return again.


What I thought about: Something is corrupting my memory (mapping). What kinds of errors are possible for this?

  • A buffer overflow?
  • The SDRAM controller could be turned off, so it loses some data. That is not impossible, but rather improbably
  • The stack is big enough, I already checked this with a watermark
  • The Data Bus Rate and AVR clock are set correctly

How to solve this: More assert? Unfortunately I cannot debug this with AVRStudio. Anyone a hint or idea? Or am I missing something obvious?


Edit:

Mentioned approaches from users:

  • Check for addresses of function pointer and array entries
  • Overwrite of stack array
  • Not properly written interrupts
  • Not initialized pointers
  • Check for array access via i at crash case
  • use exception handler address for illegal memory access
  • use snprintf instead of sprintf

Late appendix to the thread: the issue was a wrong array access (wrong index was set) in an old software module, that had nothing to do with my modules. I found this by accident, it was a curiosity that it didn't appear earlier and it took me quite a while to find the line of code. I mark the only given answer as correct solution.

Thank you all for your input.

Take care (of your software ;))

Pierre Begon
  • 166
  • 13
  • If you call that function earlier in your code is the Bus Error triggered? – LPs Sep 22 '15 at 09:11
  • @LPs: no. It is possible, that the function can be called n times and does not crash. At any time, the exception can occur. That is why I can not reproduce it properly. – Pierre Begon Sep 22 '15 at 09:18
  • Have you got a console in your application where output data (printf)? – LPs Sep 22 '15 at 09:27
  • Yap, that is where I got my stack trace with the information I mentioned above. I got a clear stack trace with the instruction registers, but the values are out of nowhere. – Pierre Begon Sep 22 '15 at 09:32
  • You can store sensible data each time that function is called and then, after the bus error retrieve them on the console. Or, if it not block your system due to too much printf, output info each time the function is called and then look at last printed info before bus error. Very old style debug.. – LPs Sep 22 '15 at 09:35
  • 1
    What is the address where the function pointer points when the bug strikes? Can you see anything strange in other members of the function pointer's struct when this happens? The line `if(mSomeArray[i])` is fishy, it suggests that you actually don't quite know where that pointer is pointing at, which in turn makes any code handling that array of pointers the most likely culprit. – Lundin Sep 22 '15 at 09:40
  • 1
    C does not support exceptions, su using the tag here is questionable (the description only refers to programming languages, not hardware-exceptions). However, `AVR` is related to the smaller 8 bit MCUs. AVR32 has a completely different architecture. – too honest for this site Sep 22 '15 at 09:59
  • @LP: I can do this type of debugging. However, I cannot see where these wrong memory values come from. – Pierre Begon Sep 22 '15 at 10:04
  • @Lundin: good point with the function pointer and the "surrounding" members. The if-statement just check for NULL pointers, as there are in a defined state and not handled with the mentioned function. How can this be tricky? – Pierre Begon Sep 22 '15 at 10:08
  • @Olaf: I agree with the c tag. The intention was to mention the language due to the buffer overflow or something like that. Thanks for changing the AVR tag. – Pierre Begon Sep 22 '15 at 10:13
  • The key point is to store input and sensible variable each time you call the problematic function. When you have those info you can start trace back to the origin of the problem. – LPs Sep 22 '15 at 10:20
  • @buffo: Sorry, my comment was about the "exeption" tag actually. I first removed that with the AVR-tag, but then re-added (and created a new avr32 tag, you could not have found that before), because it actually _is_ an exception. Actual problem is that the same word can have very different meanings. – too honest for this site Sep 22 '15 at 11:25
  • 1
    -1 There are so many ways to corrupt memory, this question is too vague. Chances are you are overwriting an array on the stack. Or your interrupts are not written properly. Or your pointers are not initialized or... The only way to debug this is step through each line of code with a debugger, or spend days simulating the code in your head to find your bug. – Mark Lakata Sep 23 '15 at 06:28
  • assuming that the memory is not corrupted (a iffy assumption) the most likely source of the problem is not enough 'wait' states being used for the external RAM. – user3629249 Sep 23 '15 at 15:27
  • assuming the problem originates in the data, what is the value of 'i' at the crash? is the function pointer selected by 'i' pointing to the correct function address? – user3629249 Sep 23 '15 at 15:37
  • @MarkLakata: I agree. After other investigations I can see, that it is clearly a software bug. I think further work on this will lead to a stack corruption due to an array overflow. – Pierre Begon Sep 24 '15 at 11:17
  • @user3629249: I can see that the memory is corrupted, due to the pointer I get, pointing to a wrong address (wrong as: not a suitable area of memory). The 'i' value is a valid one. – Pierre Begon Sep 24 '15 at 11:19

1 Answers1

1

Here are some ideas:

  1. Check 'i' to make sure it is within the array bounds.
  2. Check the address of the function pointer that is about to be called. It should have an address within the SDRAM.
  3. See if the chip has an exception handler address it will jump to when it accesses illegal memory. Once you are there, output some debug data
  4. If your debugger allows, set a breakpoint on someFunction() when it is written. This would catch some other function when it overwrites the function pointer.
Robert Deml
  • 12,390
  • 20
  • 65
  • 92
  • Thanks for the further input. As further spotted: 'i' is a valid value, the function pointer address is invalid. Unfortunately I cannot use a debugger (as mentioned in the initial post). Further investigations on my exception handler could be an useful approach. – Pierre Begon Sep 24 '15 at 11:20