35

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").

So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?

For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?

I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.

I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.

Community
  • 1
  • 1
osgx
  • 90,338
  • 53
  • 357
  • 513
  • 1
    Some links: 2005 ["**SoftECC**: A System for Software Memory Integrity Checking" by Dave Dopson](http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf), **Libsdc** 2011 ["A Tunable, Software-based DRAM Error Detection and Correction Library for HPC"](http://www.fiala.me/pubs/papers/libsdc11.pdf), **RedMPI** 2012 ["Detection and correction of silent data corruption for large-scale high-performance computing"](http://www.fiala.me/pubs/papers/sc12-redmpi.pdf) (http://redmpi.com/) and other papers from David Fiala from NCSU (http://www.fiala.me/) – osgx May 11 '14 at 03:06
  • 1
    Also, Derek Jones, The Shape of Code blogposts: [Compiling to reduce the impact of soft errors on program output (2011)](http://shape-of-code.coding-guidelines.com/2011/11/07/compiling-to-reduce-the-impact-of-soft-errors-on-program-output/), and [Source code will soon need to be radiation hardened (2014)](http://shape-of-code.coding-guidelines.com/2014/05/29/source-code-will-soon-need-to-be-radiation-hardened/) – osgx May 29 '14 at 22:40
  • Yes, there are many software solutions, see this paper: [A Survey of Techniques for Modeling and Improving Reliability of Computing Systems](https://www.academia.edu/12046032/A_Survey_of_Techniques_for_Modeling_and_Improving_Reliability_of_Computing_Systems) – user984260 Jul 12 '15 at 16:19

2 Answers2

4

The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.

http://www.cyberciti.biz/faq/ecc-memory-modules/

For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?

Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)

try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?

The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.

If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.

To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.

I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.

Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.

See also:

https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/

BraveNewCurrency
  • 12,654
  • 2
  • 42
  • 50
  • 7
    Brave, yes, ECC memory DIMMS are cheap (only 1/8 costlier in chip cost), but the hardware platform which will use ECC correction/detection is not cheap. For example, Intel disables ECC support in all memory controllers in desktop CPUs, like Core2, i3/i5/i7. Only Xeons have the ECC circuits enabled. Not sure about AMD. PS: thank you for idea of memory duplication (-30-40% for performance, +100% memory overhead). – osgx Jun 18 '14 at 17:04
  • 3
    the platforms that support ecc are not cheap, and often not all platforms are supported. If you like to avoid Intel x86 / i64 plattform, e.g. use a Arm soc the chance to find one with ecc memory is nil – humanityANDpeace Sep 08 '15 at 07:04
  • 2
    I don't think the OP was really asking about checksums, but an error correcting code like a [Hamming Code](https://en.wikipedia.org/wiki/Hamming_code). By using a few extra bits you can determine whether a given word is valid or not. Some codes can only detect a single bit error, others can detect and correct multiple bit errors. Implementing this in software would be awkward though, as you'd need to start with non-standard word lengths to fit into the 32/64 bit words of most computers after you've added the parity bits. – sandyscott Sep 22 '17 at 13:47
  • 2
    @osgx AMD support ECC on pretty much everything. It is a question of motherboard support, mainly. However, I suspect, instead of adding a hack onto your code that has a good chance of tanking your performance, it would probably be a good idea to simply buy previous-generation server hardware. Six years down the line, but hope it helps. LOL – Tripp Kinetics Nov 15 '20 at 02:06
  • Reversing the question: are there countermeasures to minimize the likelyhood of a bit flip in a non-ecc system? e.g., running the system cool, shielding from ionizing radiations, etc ? – Pa_ Jul 16 '21 at 14:48
  • 1
    @Pa_ “The materials suitable for cosmic-ray shield design are materials such as lead and iron that will stop the primary protons, and materials like polyethylene, borated polyethylene, concrete and water that will stop the induced neutrons.” —⁠“[Cosmic Ray Interactions in Shielding Materials](https://www.pnnl.gov/main/publications/external/technical_reports/PNNL-20693.pdf)” – Christian - Reinstate Monica C Dec 04 '21 at 00:54
1

The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!

Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.

vitorafsr
  • 164
  • 5
  • Why "third"? SoftECC is first, user-space (libsdc) is second only if used with SoftECC at same time. And I'm asking about hardware without hardware ECC. – osgx Jun 06 '14 at 11:12
  • It seems you are asking about a software error detection. – vitorafsr Jun 07 '14 at 02:27