Can a program fix itself (Variables)? (safety critical enviroment)

Question

I just got started into writing fail-safe, high integrity C code and I'd like to know if programs can "fix themselves" if a variable gets corrupted for whatever reason (for example cosmic rays). I know that there's specific hardware like ECC ram that can counter this but assuming that the hardware that I will be using doesn't have error correction, are there any ways a program can check itself for errors and fix itself? I know I could log every variable change somewhere and check every variable before usage if it has been changed somehow but that'd slow down a program by a large margin due to I/O speeds. Are there any other ways for a program to check and possibly fix itself?

In a certain commercial data logging operation, the design brief was to provide two loggers running in parallel, and then the results would be monitored to ensure they agree. This would not be self-fixing, but arguably having three loggers could be. — Weather Vane, Aug 09 '22 at 11:44
Is the hardware expensive? You could have three computers running the same program in parallel and sync them now and then. — klutt, Aug 09 '22 at 11:53
@klutt no the hardware is not expensive. So should I just run everything in parallel on two different machines and check if the output is always the same? What exactly do you mean with "sync them"? — Giancarlo Metitieri, Aug 09 '22 at 11:58
If you're working for a company that makes safety-critical stuff, they should know how to do this. Ask your boss. — user253751, Aug 09 '22 at 12:22
A lot of academic research went and still is going into this issue. Any answer _here_ can give you only one possible solution, and would be just kind of an opinion. You need to read really a lot, to write down your requirements, to ask the seniors of the team, to experiment, and so on. — the busybee, Aug 09 '22 at 12:49
In a safety critical system you must define a save state of your system. When you detect an error you had to switch your system into this state and not try to compensate the error. If you want to keep on running your system, you need three different machines then you could trust them if two have the same result. — Mike, Aug 10 '22 at 04:58

score 1 · Answer 1 · answered Aug 09 '22 at 14:21

You will need to perform some mathematical operation on specific parts of your memory, where your critical variables are e.g.: CRC, Hashing (@klutt already mentioned that). You could also create some wrapper around your variable and store them redundant (twice or more) and check for changes when reading them. These will not guarantee systematic errors (for example bit position 7 being defect on the bus), but are probably a very easy implementation. There are a lot of different approaches in communication for checking for errors or changes like checksums, which can be implemented fairly easy.

A different possibility is redundant hardware (as @klutt also mentioned). These are standard in todays safety applications, depending on SIL (do people die or not).

Software checking for its own errors is not intuitive, every c compiler you have will work against you. Making all redundant variables volatile will also have the unwanted effect of your program being much slower.

Lots of hard to implement possibilities in this answer, maybe someone knows a go to software solution? I dont think there is one though...

score 1 · Answer 2 · answered Aug 09 '22 at 15:40

”I know I could log every variable change somewhere and check every variable before usage..."

you "know"? How would that work? How would you know that the software doing the logging and checking has not been affected? It is not at all practical.

Critical data (especially persistent/non-vilatile data) might employ redundancy, error-detection/error-correction, but for "whole-system" integrity you would do better to monitor for correct operation. Often spontaneous data corruption, either through external interference or software error will result in incorrect operation. By using software and hardware watchdogs, you can detect many of these faults and take corrective action - often you will be able to do little more than issue a reset.

Software watchdogs are applicable only in multi-threaded/multi-tasking systems, and in some cases you might be able to restart a thread - but that requires a rather sophisticated software architecture to pull off that trick and trust the integrity of the system

score 1 · Answer 3 · answered Aug 10 '22 at 04:08

For your original question, if the program can fix itself, I think the answer should be NO because when the variable is corrupt, the functional of the program is not trusted anymore.

Safety implementation include mechanism from both hardware, software based on analysis of possible failure and failure impact. Therefore, with software only, it is not sufficient for safety measure/reaction.

For general case, there is some mechanism to prevent/recovery from this kind of errors:

ECC: this is hardware mechanism however it is able to detect/correct for single bit only. For error that happen to more than 1 bits recovery is also not possible.
Use CRC to detect error of critical data and using redundant (double the memory) to recovery. However, this approach can only protect for some specific data with assumption that the rest should still functional properly. And the normal reaction/recovery plan for this is shutdown/reset.
For more critical system, adding extra hardware for cross-check is also a good choice.

score 1 · Accepted Answer · answered Aug 15 '22 at 10:17

As a rule of thumb, the guidelines in the applicant functional safety standard should be followed (IEC 61508 or ISO 26262 etc, "SIL"/"ASIL"). These recommend the use of ECC and nowadays when there are safety MCUs, they should come with hardware ECC and claim compliance towards such functional safety standards. Such MCUs will likely throw a hardware exception upon errors and from there on there's not much you can do but to put the program in a safe mode if possible or otherwise reset.

I wouldn't really recommend the more "old school" ways such as software ECC or CRC, "walking bit" tests etc since these are cumbersome to implement and get right, adding extra software complexity and thereby risks. There's really no valid reason why you shouldn't be using a safety MCU.

If I compare older safety-related software I wrote mid 2000s, which implements CRC, walking bit tests of RAM, redundancy in duplicate memory segments and so on, it's so much more complex than programs I've written for a safety MCU, where the hardware handles pretty much everything.

Implementing complex safety mechanisms as well as software complexity in general is a hazard! It has been proven over and over again that the amount of bugs in a program is relative to its complexity - always follow the "Keep it simple stupid" (KISS) principle.

score 0 · Answer 5 · answered Aug 09 '22 at 12:08

0

If you're running on Arduino or something like that, I'd suggest having two of the same hardware running the same program. Then you can check that they produce the same result. Maybe even periodically compare the whole memory to see that it is identical.

Of course, this could also be done with virtual machines if your hardware is enough for that.

If it's critical that the program keeps running, use three machines and use the result that is produced from at least two machines. That's what they did on Saturn V.

answered Aug 09 '22 at 12:08

klutt

30,332
17
55
95

How is it possible to compare the memory of two hardware devices? I don't know any C function that can "output the whole memory" to check it. Would I need a "Controller device" for the three of the same hardware? Also, should I output everything digitally, compare it with an AND gate and then turn it into an analog signal (that's what I need) or are there other ways to compare outputs? – Giancarlo Metitieri Aug 09 '22 at 12:15
On a bare metal machine, I guess you could create a hash sum of the whole memory. Or possibly only the memory the program is using. – klutt Aug 09 '22 at 12:21
Writing safety-related programs using an Arduino would likely be outright criminal. – Lundin Aug 15 '22 at 10:21
Also, space electronics is quite unique since it is often very much mission-critical but often not so much safety-critical. A majority vote function might make perfect sense to keep a system running no matter what, while at the same time it can be considered critical in a safety-related system. Not to be confused with _redundancy_ = having two different MCUs coming up with the same result and errors are _not_ accepted; that's a common way to implement safety-critical systems. – Lundin Aug 15 '22 at 10:25

score 0 · Answer 6 · answered Aug 09 '22 at 21:38

Actually, if you have a certain safety level, then you actually need from the start a so called "Technical Safety Concept" and a Hazard Analysis / System FMEA. This concept will actually bring Technical Safety Requirements, which also might be HW requirements, e.g. a processor and external peripherals with safety features. Usually, a SoC/chip vendor will then also have some Safety Manual, which states, what the processor/SoC supports and what not, and which safety measures you have to supply yourself (e.g. two different calculations, ALU/FPU checks, BIST, dual clock compare (reference clock to clocksource), ... ).

That's why this safety critical stuff cost more than non-safety-critical stuff (beginning with HW up to additional tests and source code reviews), and it sounds like in your case, there is no such analysis done, and someone has chosen the wrong HW. If you don't have HW support, you have to cope with the additional resource/runtime usage.

You might have to do:

cyclic ALU/FPU checks
Memory Checks (CRC) if there is no ECC, you have to do it and cope with the additional runtime and memory
NvM data might need a CRC stored with the data and checked before using the NvM data

For safety critical systems, you have to apply "state-of-the-art" technics, otherwise you can be held responsible and sued in case of hazards happening. I had such a product safety and reliability training like 20 years ago, and came out of it actually thinking, to be already with one foot in jail.

Can a program fix itself (Variables)? (safety critical enviroment)

6 Answers6