5

I have a device which has an library. Some of its functions are most awesomely ill-behaved, in the "occasionally hang forever" sense.

I have a program which uses this device. If/when it hangs, I need to be able to recover gracefully and reset it. The offending calls should return within milliseconds and are being called in a loop many many times per second.

My first question is: when a thread running the recalcitrant function hangs, what do I do? Even if I litter the thread with interruption points, this happens:

boost::this_thread::interruption_point(); // irrelevant, in the past
deviceLibrary.thatFunction(); // <-- hangs here forever
boost::this_thread::interruption_point(); // never gets here!

The only word I've read on what to do there is to modify the function itself, but that's out of the question for a variety of reasons -- not least of which is "this is already miles outside of my skill set".

I have tried asynchronous launching with C++11 futures:

// this was in a looping thread -- it does not work: wait_for sometimes never returns
std::future<void> future = std::async(std::launch::async, 
    [this] () { deviceLibrary.thatFunction(*data_ptr); }); 
if (future.wait_for(std::chrono::seconds(timeout)) == std::future_status::timeout) { 
    printf("no one will ever read this\n"); 
    deviceLibrary.reset(); // this would work if it ever got here
}

No dice, in that or a number of variations.

I am now trying boost::asio with a thread_group of a number of worker threads running io_service::run(). It works magnificently until the second time it times out. Then I've run out of threads, because each hanging thread eats up one of my thread_group and it never comes back ever.

My latest idea is to call work_threads.create_thread to make a new thread to replace the now-hanging one. So my second question is: if this is a viable way of dealing with this, how should I cope with the slowly amassing group of hung threads? How do I remove them? Is it fine to leave them there?

Incidentally, I should mention that there is in fact a version of deviceLibrary.thatFunction() that has a timeout. It doesn't.

I found this answer but it's C# and Windows specific, and this one which seems relevant. But I'm not so sure about spawning hundreds of extra processes a second (edit: oh right; I could banish all the calls to one or two separate processes. If they communicate well enough and I can share the device between them. Hm...)

Pertinent background information: I'm using MSVC 2013 on Windows 7, but the code has to cross-compile for ARM on Debian with GCC 4.6 also. My level of C++ knowledge is... well... if it seems like I'm missing something obvious, I probably am.

Thanks!

Community
  • 1
  • 1
MechEngineer
  • 113
  • 5

3 Answers3

9

If you want to reliably kill something that's out of your control and may hang, use a separate process.

While process isolation was once considered to be very 'heavy-handed', browsers like Chrome today will implement it on a per-tab basis. Each tab gets a process, the GUI has a process, and if the tab rendering dies it doesn't take down the whole browser.

How can Google Chrome isolate tabs into separate processes while looking like a single application?

Threads are simply not designed for letting a codebase defend itself from ill-behaved libraries. Processes are.

So define the services you need, put that all in one program using your flaky libraries, and use interprocess communication from your main app to speak with the bridge. If the bridge times out or has a problem due to the flakiness, kill it and restart it.

Community
  • 1
  • 1
  • Yeah, this does make perfect sense. Thank you -- I had no idea interprocess communication was even a thing, to be honest; the other answers I read that suggested processes were all super Win32API specific and/or proposed synchronization through stdout or the filesystem which seemed inefficient at best. It looks like Boost.Interprocess is cross-platform, maybe I'll start there... – MechEngineer Sep 13 '14 at 21:48
1

I am only going to answer this part of your text: when a thread running the recalcitrant function hangs, what do I do?

A thread could invoke inline machine instructions. These instructions might clear the interrupt flag. This may cause the code to be non interruptible. As long as it does not decide to return, you cannot force it to return. You might be able to force it to die (eg kill the process containing the thread), but you cannot force the code to return.

I hope my answer convinces you that the answer recommending to use a bridge process is in fact what you should do.

user2587106
  • 315
  • 1
  • 5
  • Those machine instructions need not be inline. It would just be machine instructions anyways. – sehe Sep 13 '14 at 22:26
-3

The first thing you do is make sure that it's the library that's buggy. Then you create a minimal example that demonstrates the problem (if possible), and send a bug report and the example to the library's developer. Lastly, you cross your fingers and wait.

What you don't do is put your fingers in your ears and say "LALALALALA" while you hide the problem behind layers of crud in an attempt to pretend the problem is gone.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • That's all very well in an ideal world, but it doesn't address the real-world problem. Many modern systems have process-based sandboxing to deal with exactly this issue (the other answer gives Chrome as an example). – Oliver Charlesworth Sep 13 '14 at 11:43
  • Oh, I recognize that this is the equivalent of (in my industry) "just splint the damn thing with a plate and a secondary op" when it's really that the part needs to be redesigned and refabricated. No one's trying to pretend the problem is gone. But I'd be waiting an awfully long time with crossed fingers if I were to rely on, say, trying to get support on an obscure issue in a proprietary library for a device from a company that no longer exists. So I'm pretty grateful for the suggestions on how best to splint this. – MechEngineer Sep 13 '14 at 22:01
  • @MechEngineer: Is replacing the device an option? If not, can you just write your own library to talk to the device driver (and/or reverse engineer the library)? – Brendan Sep 15 '14 at 02:24