What causes Silent Data Corruption on HDDs?

Question

Some landmark studies a number of years ago now showed that silent corruption in large datasets was far more widespread than previously anticipated (and today I guess you'd say it's more than commonly realized).

Assuming that the application and OS wrote a sector and had time to let everything flush, with no crash or abnormal shutdown, or software bugs that would command wrong data to be saved.

Later, a sector is read back in, and there is no read error from the HDD. But it contains the wrong data.

Since HDD data encoding contains error correction codes, I would assume that any mysterious state change to a bit would generally be noticed by the checking. Even if the check is not very strong so some errors slip through, there would still be vastly more detected errors that tell you something is wrong with the drive. But that doesn't happen: apparently, data is found to be wrong with no symptoms.

How can that happen?

My experience on a desktop PC is that sometimes files that were once good are later found to be bad, but perhaps that is due to unnoticed problems during writing, either moving the sectors or in the file system tracking of the data. Point is, errors may be introduced at write time where data is corrupted inside the HDD (or RAID hardware) so wrong data is written with its error correction codes to match. If that is the (only) cause, than a single verify should be enough to show that it did write OK.

Or, does data go bad after it has been seen to be OK on the disk? That is, verify once and all is fine; verify later and an error is found, when that sector has not been written in the interim. I think this is what is meant, since the write-time errors would be easy to deal with through improved flushing checking.

So how can that happen without tripping the error correction codes that go with the data?

Somebody doesn't like this question for some unknown unspecified reason? What am I supposed to make of that? — JDługosz, Feb 22 '15 at 23:36
Another anonymous downvote. How impolite and pointless. Please explain what's wrong with the question and how it may be improved. — JDługosz, Aug 05 '15 at 15:21

score 6 · Answer 1 · answered Nov 21 '17 at 12:50

Some ways silent data corruption could happen:

Corruption in memory before the data is written (in this case even filesystem level checksums will not help you if the checksum is calculated after the corruption)
Errors in the SATA cables that by chance match the checksum
Bit flips in disk drive cache memory (not sure if those are checksummed, probably depends on make & model)
Bug in drive firmware that corrupts the data before writing (with the checksum matching the corrupted data)
corruption of the block on the disk platter that by chance matches the checksum
read that returns corrupted data to the drive controller that by chance matches the checksum
bugs in firmware that corrupt the data after verifying the checksum
corruption in main memory after the data has been moved there
bugs in the software that processes the data (although this is usually not considered part of this list, but is classified as a software bug)

Corruption that by chance matches its error code is by itself very unlikely, but the large amount of data and the birthday paradox ensure that they do happen. Today's drives have internal read errors all the time and rely heavily on checksums to catch them. If so they just re-read the sector until they have a good read, and if a sector becomes too bad they silently swap it with a spare sector. SATA controllers probably also silently re-send data if a checksum error occurrs while reading data from the SATA cable.

The chance of a random corruption still matching the checksum can be made arbitrarily small by using a longer checksum, but that involves more storage and processing overhead. And in the case of standardised protocols such as SATA you can't just change the checksum size without breaking compatibility. And no protocol or disk level checksumming will save you from firmware bugs, or other software bugs for that matter.

The big advantage of filesystem level checksums like in ZFS/Btrfs is that they can catch all of these errors except main memory corruption (use ECC memory to protect against that) and software bugs. And they can use a larger checksum block size than a single disk block, to reduce the storage overhead of longer checksums.

Tarik · Answer 2 · 2015-02-16T11:17:16.923

See http://en.wikipedia.org/wiki/Silent_data_corruption#Silent_data_corruption that provides ample explanations. I would also like to mention the birthday paradox that explains why the probability of an error is higher that intuitively expected. See http://en.wikipedia.org/wiki/Birthday_paradox.

Upon writting, a sector a CRC is calculated and written to disk. Upon reading, the data is read along with the CRC. The CRC is recalculated from the data that has been read from the disk and then compared with the CRC read from the disk.

What could go wrong at the disk level but would be detected: - One or more data bit did not get written correctly. - One or more CRC bit did not get written properly to disk. - Both have been correctly written to disk but dammaged later on. - Both have been written correctly but the controller went bad or is buggy.

What could go wrong on the disk but would go undetected (silent error): - Data or CRC is corrupted either because badly written on disk or upon reading due to a deffective sector, yet (although with low probability) the calculated CRC matches the CRC read from the device. That's where the birthday paradox comes into play.

Passed the disk: - Data is read correctly from the disk by the controller but is incorrectly transmitted to memory through the SATA cable. I assume SATA has some type of error correction but again, you can repeat the process here. - The data made it through from the disk to the controller and went through the SATA cable but a memory bit got inverted.

That's the same link I mentioned in the Question. It's not an ample explaination, but says "it happens. Some is due to memory/software in addition to the actual disk." I want to know about what can happen to the disk that's not a simple write error but goes bad somehow. — JDługosz, Feb 16 '15 at 07:43
Thanks for elaborating. Re birthday paradox: assuming the CRC is short, the birthday would mean matches happen, but non-matches are still far more common and I'd expect thousands of reported read errors for each silent error. (I think the ECC's are far more elaborate than a short CRC on current drives, though) — JDługosz, Feb 17 '15 at 15:23
If a detectable CRC error occurs the SATA controller will silently re-request or send the data, and a disk controller will re-read the sector until it has a good read. So unless you explicitly go looking for the number of retries/crc failures/whatever you won't notice those. Disk drives today have internal read errors all the time that are silently corrected. — JanKanis, Nov 21 '17 at 12:44

score 3 · Answer 3 · answered Aug 24 '15 at 01:51

One type of silent corruption is caused by the fact that many hard drives have a little bit of write cache on the hard drive itself (different from cache on the disk controller and/or the operating system).

For this hard drive internal cache, in most cases, it's not power down safe. Ie., if someone pulls the plug during high I/O write rate operations, data in the hard drive cache is lost.

For example, on high end Dell based database enterprise servers I've worked on, we had a high chance of database corruption if someone accidentally pulled the plug while heavy write operations were under way (maybe 50% of the time). Our standard operating procedure now is to disable hard drive internal caching (disabling disk controller caching, or having writeback caching set at the disk controller level didn't turn off the hard drive internal caches for us)

See http://brad.livejournal.com/2116715.html for details and a testing tool to detect this.

Note that an operating system crash doesn't cause this kind of corruption. Only a power failure.

In my environment we use the MegaCli utility ( https://www.broadcom.com/support/knowledgebase/1211161498596/megacli-cheat-sheet--live-examples ). There's also PercCLI ( https://www.dell.com/support/kbdoc/en-us/000177280/how-to-use-the-poweredge-raid-controller-perc-command-line-interface-cli-utility-to-manage-your-raid-controller ) Sorry for the ugly formatting. No other choice for StackOverflow comments. — Ben Slade, Aug 26 '22 at 14:27
Also, for newer enterprise SSDs with internal cache, many of them have a capacitor that gives them enough power/time to flush the internal cache to the non-volatile flash memory when power is lost. See https://www.micron.com/-/media/client/global/documents/products/white-paper/ssd_power_loss_protection_white_paper_lo.pdf — Ben Slade, Aug 26 '22 at 14:37

What causes Silent Data Corruption on HDDs?

3 Answers3