How to prevent "partial write" data corruption during power loss?

Question

In an embedded environment (using MSP430), I have seen some data corruption caused by partial writes to non-volatile memory. This seems to be caused by power loss during a write (to either FRAM or info segments).

I am validating data stored in these locations with a CRC.

My question is, what is the correct way to prevent this "partial write" corruption? Currently, I have modified my code to write to two separate FRAM locations. So, if one write is interrupted causing an invalid CRC, the other location should remain valid. Is this a common practice? Do I need to implement this double write behavior for any non-volatile memory?

Clifford · Accepted Answer · 2017-12-14T20:46:00.527

A simple solution is to maintain two versions of the data (in separate pages for flash memory), the current version and the previous version. Each version has a header comprising of a sequence number and a word that validates the sequence number - simply the 1's complement of the sequence number for example:

---------
|  seq  |
---------
| ~seq  |
---------
|       |
| data  |
|       |
---------

The critical thing is that when the data is written the seq and ~seq words are written last.

On start-up you read the data that has the highest valid sequence number (accounting for wrap-around perhaps - especially for short sequence words). When you write the data, you overwrite and validate the oldest block.

The solution you are already using is valid so long as the CRC is written last, but it lacks simplicity and imposes a CRC calculation overhead that may not be necessary or desirable.

On FRAM you have no concern about endurance, but this is an issue for Flash memory and EEPROM. In this case I use a write-back cache method, where the data is maintained in RAM, and when modified a timer is started or restarted if it is already running - when the timer expires, the data is written - this prevents burst-writes from thrashing the memory, and is useful even on FRAM since it minimises the software overhead of data writes.

I see how using seq and ~seq instead of CRC could be advantageous in certain situations. But for Flash, it seems to me that the CRC would help in the case of a bad memory block? — schumacher574, Jan 11 '14 at 17:24
Unless of course you can detect a write failure due to a bad memory block at the time of writing, which would prevent the seq and ~seq from being written and prevent that data from being read on start-up. — schumacher574, Jan 11 '14 at 17:32
In my implementation, individual data items were separately validated; it was important that a single data item failure did not result in all data being discarded. The problem being solved here is specifically one of incomplete write detection rather than data corruption. If you want to combine whole-block data validation and incomplete write protection, then a CRC is the way to go. — Clifford, Jan 11 '14 at 18:07

DrRobotNinja · Answer 2 · 2014-01-13T03:55:47.980

6

Our engineering team takes a two pronged approach to these problem: Solve it in hardware and software!

The first is a diode and capacitor arrangement to provide a few milliseconds of power during a brown-out. If we notice we've lost external power, we prevent the code from entering any non-violate writes.

Second, our data is particularly critical for operation, it updates often and we don't want to wear out our non-violate flash storage (it only supports so many writes.) so we actually store the data 16 times in flash and protect each record with a CRC code. On boot, we find the newest valid write and then start our erase/write cycles.

We've never seen data corruption since implementing our frankly paranoid system.

Update:

I should note that our flash is external to our CPU, so the CRC helps validates the data if there is a communication glitch between the CPU and flash chip. Furthermore, if we experience several glitches in a row, the multiple writes protect against data loss.

edited Jan 13 '14 at 03:55

answered Jan 11 '14 at 07:18

DrRobotNinja

1,381
12
14

The multiple copies are important perhaps for Flash, but on a FRAM device only two copies are necessary since it has unlimited endurance. Even on Flash, a write-back cache may suffice depending on the update pattern and frequency. – Clifford Jan 11 '14 at 09:34
Good advice on the hardware side. Unfortunately, my current project is too far along to request a change, but this will definitely be something to consider for future projects. – schumacher574 Jan 11 '14 at 17:11
Sorry for my inexperience, but with your hardware setup do you receive an interrupt when external power is lost? – schumacher574 Jan 11 '14 at 17:36
2

@schumacher574: Looking specifically at the MSP430, depending on the variant you may have zero, one or both of "brown-out reset (BOR)", and "supply voltage supervisor (SVS)". The latter can generate a reset or interrupt, although whether you have sufficient time to do anything useful will depend on your power supply design and the current draw at the time of the loss of power. If you do not have an SVS an external supervisor can be used. See: http://www.ti.com/lit/ml/slap126/slap126.pdf – Clifford Jan 12 '14 at 08:57
2

@schumacher574, our hardware setup is for motor control and it requires us to monitor the main system voltage which is around 40 volts. If the system voltage drops to zero, we know we have a few milliseconds until our logic voltage browns out, so we essentially get the power loss event for free. – DrRobotNinja Jan 13 '14 at 03:52

score 5 · Answer 3 · answered Jan 12 '14 at 12:13

We've used something similar to Clifford's answer but written in one write operation. You need two copies of the data and alternate between them. Use an incrementing sequence number so that effectively one location has even sequence numbers and one has odd.

Write the data like this (in one write command if you can):

---------
|  seq  |
---------
|       |
| data  |
|       |
---------
| seq   |
---------

When you read it back make sure both the sequence numbers are the same - if they are not then the data is invalid. At startup read both locations and work out which one is more recent (taking into account the sequence number rolling over).

score 0 · Answer 4 · answered Apr 10 '14 at 15:18

Always store data in some kind of protocol , like START_BYTE, Total bytes to write, data , END BYTE. Before writting to external / Internal memory always check POWER Moniter registers/ ADC. if anyhow you data corrupts, END byte will also corrupt. So that entry will not vaild after validation of whole protocol. checksum is not a good idea , you can choose CRC16 instead of that if you want to include CRC into your protocol.

How to prevent "partial write" data corruption during power loss?

4 Answers4