0

I have detected a Data corruption in RAM in one of STM32 based device.

The issue is very rare and not reproducible. In one device, I have seen only single bit corruption. But it causes a lot of damage. From the behavior of other corrupted devices, I understand similar but not the same data corruption occured in the same region.

The corrupted value belongs to a static global array, which isn't modified in run time. From the Map file, I can see the array is placed in RAM base address.

I want to find out root cause of this corruption. What might be the most efficient way to go for this?

Project is developed using IAR EW for ARM and I am using ST-Link V2 Debugger.

  • 3
    "The issue is very rare and not reproducible" — that's the worst kind and you may never find a cause for this. A single-bit corruption, especially, can be caused by software or hardware issues. In the latter case, it can be a defect in the chip, a radioactive atom disintegrating nearby, cosmic rays remnant, an electromagnetic wave... Just be aware of that – SteffX Aug 24 '23 at 12:40
  • 1
    Because you can never be sure that a RAM holds its content all the time, safety-related software needs to implement safety measures against corruption. If you took the seriously expected measures against corruption in the first place (clean supply, operation within specified limits, protection against radiation, you name it), you just cannot do anything against the very small probability left over and you need to consider corruption. – the busybee Aug 24 '23 at 13:51
  • 2
    If you can reproduce it frequently then it's likely caused by a bug. In that case you would hook up a debugger with trace and set a write breakpoint on the cell. When a value gets written there, you view the trace to see who's the culprit. – Lundin Aug 24 '23 at 14:09
  • The simplest thing to do is to place a Data Watchpoint (data breakpoint) on as much of the data in question as possible, check your toolchain for the size of Data Watchpoint, or mask on the address. A more complex method would involve MPU. – wek Aug 24 '23 at 14:10
  • 1
    In case it's caused by hardware or cosmic rays, well... here's a check list of things you can do: https://stackoverflow.com/a/36892379/584518 – Lundin Aug 24 '23 at 14:10
  • Is it possible you've got a software bug writing to the bit-banding memory region that is mapped to your array? Does it happen on multiple devices? Is it the same place all the time? – Russ Schultz Aug 24 '23 at 15:00
  • It's not clear from the question, but could this array be stored in flash instead of RAM? You said "a static global array, which isn't modified in run time". – pmacfarlane Aug 24 '23 at 19:09
  • do you have your flash wait states set correctly for the clock speed and power? – old_timer Aug 24 '23 at 23:26
  • @pmacfarlane I have stated in the question- "From the Map file, I can see the array is placed in RAM base address." – Monem Ahmed Aug 25 '23 at 03:21
  • @old_timer, I have to check that. Can you explain a bit about how it may affect the RAM data? The data that gets corrupted is loaded in the RAM at startup and never modified afterward. – Monem Ahmed Aug 25 '23 at 06:06
  • I am asking why is it stored in RAM? If it never changes, it should probably be stored in flash. Make it `const`. – pmacfarlane Aug 25 '23 at 07:28
  • @pmacfarlane Yes, That should have been done. I will do that, that might as well fix this particular issue. However, I am trying to find out the root cause to protect any other RAM corruption issue in the future. – Monem Ahmed Aug 25 '23 at 08:14
  • just a random thought: if your device has MPU, you can control RAM access at various moments of program execution. I would consider trying to lock various regions of RAM and see when illegal access exception happens. You can extract some useful data about where and how the error happened in the exception handler. It's not a silver bullet, there is no guarantee that you find it, but worth a shot. – Ilya Aug 26 '23 at 15:08

2 Answers2

0

Duplicate your data and scatter verifying code like assert(0==memcmp(&Data, &Copy, sizeof(Data))); on your code. This is a low-level approach which takes a lot of time but may help to narrow the spot where the memory corruption takes place. Seriously, I localized a memory corruption by using ugly code like assert(*(int*)0x123==456);.

user5329483
  • 1,260
  • 7
  • 11
0

static global array, which isn't modified in run time.

Place it in FLASH - it would not be modified

0___________
  • 60,014
  • 4
  • 34
  • 74