6

How do I test that my program is robust to unexpected shut-downs?

My python code will run on a microcontroller that shuts off unexpectedly. I would like to test each part of the code rebooting unexpectedly and verify that it handles this correctly.

Attempt: I tried putting code into its own process, then terminating it early, but this doesn't work because MyClass calls 7zip from the command line which continues even after process dies:

import multiprocessing
import os

def MyClass(multiprocessing.Process):
   ...
   def run():
      os.system("7z a myfile.7z myfile")


process = MyClass()
process.start()
time.sleep(4)
print("terminating early")
process.terminate()
print("done")

What I want:

class TestMyClass(unittest.TestCase):
    def test_MyClass_continuity(self):
        myclass = MyClass().start()
        myclass.kill_everything()
        myclass = MyClass().start()
        self.assert_everything_worked_as_expected()

Is there an easy way to do this? If not, how do you design robust code that could terminate at any point (e.g. testing state machines)?

Similar question (unanswered as of 26/10/21): Simulating abnormal termination in pytest

Thanks a lot!

  • 1
    Your first code example has some syntax errors, but apart from that, don't put code you want to run in a subprocess in the __init__, that's running in the main process. Although the correct behavior should be for it to block, not background, are you sure it ran in the background? – Jason S Nov 03 '21 at 03:57
  • Thanks, you're right, updated now - code was meant to be running in its own process (defined in the **run** function), not in **init**. – benjamin deworsop Nov 03 '21 at 07:01
  • 1
    Could [process groups](https://stackoverflow.com/a/322317/1016216) help here? – L3viathan Nov 05 '21 at 21:07
  • microcontroller or microprocessor like raspberry? – bitbang Nov 06 '21 at 21:59
  • 1
    idk about pytest, but the approach I would take to this is take a look at every system call (primarily open files / attached hardware) and make sure you handle every type of error it could return. directory doesn't exist: create it. Serial device not attached: wait for attach event. etc... – Aaron Nov 06 '21 at 23:16

2 Answers2

2

Your logic starts a process wrapped within the MyClass object which itself spawns a new process via the os.system call.

When you terminate the MyClass process, you kill the parent process but you leave the 7zip process running as orphan.

Moreover, the process.terminate method sends a SIGTERM signal to the child process. The child process can intercept said signal and perform some cleanup routines before terminating. This is not ideal if you want to simulate a situation where there is no chance to clean up (a power loss). You most likely want to send a SIGKILL signal instead (on Linux).

To kill the parent and child process, you need to address the entire process group.

import os
import time
import signal
import multiprocessing


class MyClass(multiprocessing.Process):
    def run(self):
        # Ping localhost for a limited amount of time
        os.system("ping -c 12 127.0.0.1")


process = MyClass()
process.start()

time.sleep(4)

print("terminating early")

# Send SIGKILL signal to the entire process group
group_id = os.getpgid(process.pid)
os.killpg(group_id, signal.SIGKILL)

print("done")

The above works only on Unix OSes and not on Windows ones.

For Windows, you need to use the psutil module.

import os
import time
import multiprocessing

import psutil


class MyClass(multiprocessing.Process):
    def run(self):
        # Ping localhost for a limited amount of time
        os.system("ping -c 12 127.0.0.1")


def kill_process_group(pid):
    process = psutil.Process(pid)
    children = process.children(recursive=True)

    # First terminate all children
    for child in children:
        child.kill()
    psutil.wait_procs(children)

    # Then terminate the parent process
    process.kill()
    process.wait()


process = MyClass()
process.start()

time.sleep(4)

print("terminating early")

kill_process_group(process.pid)

print("done")
noxdafox
  • 14,439
  • 4
  • 33
  • 45
1

I think this is a question of data persistence and consistency. You need to make sure all data that is persistent (i.e. written to disk) is consistent, too.

Imagine some sort of data written to a status file. What will be read by the application after an unexpected termination? Half of the new status and half of the previous one? Half of the new status and the rest all 0x00?

So the answer to your question "How do you design robust code that could terminate at any point?" is to use atomic operations when working with persistent data. Most databases give some guarantees in that direction. And for working with local files I personally usually work using renaming files. This way I can write to a temporary file without worrying about consistency at all and only when that is done (be sure to flush the buffers!) and therefore consistent I use the atomic operation of a rename to make the temporary file the new single point of truth and therefore also persistent. If at any point in the process the application terminates unexpectedly the persistent data will alway be consistent. It will either be the previous state (and some garbage within a temporary file) or the new state, but nothing in between.

Whatever your choice is, be sure to read the documentation about the atomicity to understand what could happen. I.e. a file rename interrupted at the right point in time could look like the creation of a hard-link.

Note that just killing a process is not the same as cutting the power, because the OS keeps running and closes files, flushes buffers etc. For example when using SQLite I rarely see "journal files" when just killing the application, but I see them quite often when cutting the power.

SebDieBln
  • 3,303
  • 1
  • 7
  • 21