4

I'm developing a Qt 6.4.0 application that runs on Ubuntu 22.04 (Jammy Jellyfish), and it uses Qt OPC UA to connect to several PLCs on the network. After running some hours, my application crashes. Long story short, I discovered it happens when a PLC goes offline. To be sure, I simulate a host going offline with:

sudo ufw reject out to 192.168.1.110
sudo ufw reject in from 192.168.1.110

And I can replicate the crash. Inspecting the issue with a debugger, I found out the problem is at line 42096 of open62541.c:

if(client->channel.state != UA_SECURECHANNELSTATE_OPEN) {
    UA_LOG_INFO(&client->config.logger, UA_LOGCATEGORY_CLIENT,
                "SecureChannel must be connected before sending requests");
    return UA_STATUSCODE_BADSERVERNOTCONNECTED;
}

I'm puzzled they didn't check for the value of the pointer (in fact it's 0x00 as you can see in the image). This happens when I try to read anything from the remote host. If I disable the readings, I can correctly detect the host is down and reconnect to it after removing the rules.

Here how I read the values:

void MyOpcUa::readAll()
{
    QMapIterator<QString, Node_t> i(_mapOpcNodes);
    while (i.hasNext())
    {
        i.next();
        Node_t node = i.value();
        if (node.enableRead)
        {
            node.updated = false;
            node.node->readValueAttribute();
        }
    }
}

void MyOpcUa::insertNode(QString key, QString id, bool enableRead)
{
    Node_t node;
    node.node = _opcUaClient->node(id);
    node.value = QVariant();
    node.updated = false;
    node.enableRead = enableRead;
    _mapOpcNodes.insert(key, node);
    connect(node.node, &QOpcUaNode::attributeRead, this, &MyOpcUa::handleAttributes);
}

I call readAll() if the host seems online. I rely on the &QOpcUaClient::disconnected signal. But this is emitted some time after.

Since the readValueAttribute() function is asynchronous (it fires the attributeRead signal after some time), the error happens outside my code—between the call of the readValueAttribute() and the slot.

How should I prevent such a crash?

Rabbid76
  • 202,892
  • 27
  • 131
  • 174
Mark
  • 4,338
  • 7
  • 58
  • 120
  • 4
    [What is a debugger and how can it help me diagnose problems?](https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) – Jesper Juhl Jun 14 '23 at 10:09
  • 2
    Probably `client` points nowhere. What did the debugger tell you? – Jabberwocky Jun 14 '23 at 10:11
  • @Jabberwocky is in the image: `client` is `nullptr` (0x00) – Mark Jun 14 '23 at 10:25
  • 1
    @JesperJuhl not sure about your hint. The debugger clearly says the pointer is null. But I cannot change that code, it's not mine. I'm kindly asking how to catch that exception that is outside my code. – Mark Jun 14 '23 at 10:26
  • 9
    You cannot catch this exception. This is a crash. A bug. The program's finished. Kaput. It's met its maker. It's pining for the fjords. It is an ex-program. The only realistic course of action is to fix the bug. How certain are you that it is not your code. What proof do you have of that? Just because the code crashed in a library doesn't mean that's where the problem is. C++ does not work this way. The problem can be in your code, that uses this library, that's triggering the bug. I can easily write code that crashes `strcpy`. There's nothing wrong with `strcpy`. – Sam Varshavchik Jun 14 '23 at 10:40
  • @SamVarshavchik thanks for your kindness. I assume the error is not in my code since I showed how I call the functions. If there is an error I'm glad to understand what is my fault. But as far as I can see, I call `readValueAttribute()` then my code returns to the event loop waiting for the signal `attributeRead`. And the crash happens in between. – Mark Jun 14 '23 at 11:02
  • If I'm wrong please explain me where, or ask for further details. – Mark Jun 14 '23 at 11:02
  • Putting it in another way: how to begin a read knowing a host may go offline during the reading? – Mark Jun 14 '23 at 11:05
  • Even though there might've not be any obvious issues with the shown code, that proves nothing. There's plenty more code that's not shown, where a bug may exist, that eventually results in the crash. C++ is hard. Only a portion of learning C++ involves learning its syntax. Learning safe programming practices is even more important. For example, what is this mysterious `_opcUaClient` pointer is all about? Can its node() method return a nullptr. What happens when the end result gets passed to the equally mysterious connect(). Many more problems are possible. – Sam Varshavchik Jun 14 '23 at 11:16
  • 1
    @SamVarshavchik [QOpcUaClient](https://doc.qt.io/qt-6/qopcuaclient.html) and [QObject::connect()](https://doc.qt.io/qt-6/qobject.html#connect). No mysteries... they are standard functions of the Qt framework. – Mark Jun 14 '23 at 11:26
  • 2
    I'm aware safe programming is hard... for this reason I'm asking help – Mark Jun 14 '23 at 11:26
  • Of course the bug can be in other part of my code, but just commenting out `node.node->readValueAttribute();` leads to work even simulating the host going offline. I can be wrong of course, but this put me on this way – Mark Jun 14 '23 at 11:29
  • Have you tried a static analysis tool, like valgrind. – Sam Varshavchik Jun 14 '23 at 11:31
  • @SamVarshavchik I'm going to study it and I'll report back when I learned how to use it. But can you suggest how to read an OPC-UA variable when the host goes offline before the signal is fired? – Mark Jun 14 '23 at 11:38
  • `if(!client)return nullptr; if(client->channel.state != UA_SECURECHANNELSTATE_OPEN) { UA_LOG_INFO(&client->config.logger, UA_LOGCATEGORY_CLIENT, "SecureChannel must be connected before sending requests"); return UA_STATUSCODE_BADSERVERNOTCONNECTED; }` – Yanis600 Jun 14 '23 at 18:53
  • 1
    @Yanis600 I cannot hack the open62541 library. I have to live with their code and change my own to avoid the crash. – Mark Jun 15 '23 at 06:10
  • 8
    The question is [discussed on meta](https://meta.stackoverflow.com/questions/425154/question-closed-was-my-explanation-poor-or-is-it-an-xy-problem) – BDL Jun 15 '23 at 08:59
  • 6
    Please review *[Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/questions/285551/)* (e.g., *"Images should only be used to illustrate problems that* ***can't be made clear in any other way,*** *such as to provide screenshots of a user interface."*) and [do the right thing](https://stackoverflow.com/posts/76472333/edit). (But *** *** *** *** *** *** *** *** *** ***[without](https://meta.stackexchange.com/a/131011)*** *** *** *** *** *** *** *** *** *** "Edit:", "Update:", or similar - the question should appear as if it was written right now.) – Peter Mortensen Jun 15 '23 at 09:44
  • 1
    Submit a bug report (and if possible, a patch) to Qt. – n. m. could be an AI Jun 15 '23 at 10:00
  • 6
    @Mark `I cannot hack the open62541 library. I have to live with their code and change my own to avoid the crash.` This is false, [open62541 is an open-source project](https://github.com/open62541/open62541) which means you are freely available to modify it. Ideally you would open an issue for this on their issue tracker, submit a patch that fixes the issue (which would probably be similar to what Yanis600 posted), allow them to merge it in, then use the build with your patch in it. – Ian Kemp Jun 15 '23 at 10:58
  • @Mark you have to check if client is not null if so, you must stop the execution or something else you want to do. – Yanis600 Jun 15 '23 at 12:03
  • 17
    Stunned that no one asked you for a [mre] (MRE) yet. If this is a bug in the library, you'll need a MRE that demonstrates it to file a useful bug report. If it's not a bug in the library, making the MRE usually makes that pretty clear fairly quickly. Often you don't even get more than a few divide-and-conquer passes through the construction of the MRE before you've reduced the noise around the mistake enough to spot and fix the mistake. And if you've done the MRE and still stuck you have an example that a domain expert can experiment with and report back. When you make a MRE everybody wins. – user4581301 Jun 15 '23 at 16:57
  • 1
    I don't agree with Yanis's "fix". I think you should follow up the call stack to whatever orchestrator function first copies that pointer out of whatever data structure holds it, and starts dispatching "member-ish functions" through a NULL pointer. (They are C-style OOP methods, not using the C++ syntactic sugar. But still, the responsibility lies with the caller to not pass a null object instance to an OOP method.) Null-testing in leaf functions would just move the crash to the next method called by the naughty orchestrator. – Ben Voigt Jun 15 '23 at 20:40
  • So you've successfully used the debugger to find where the crash is occurring, and what it is (a nullptr when a valid pointer is expected). Since it is happening in library code, you now need to figure out how it is getting there. Most likely the cause is passing a nullptr to some library API that doesn't expect it. You can likely find that by looking through the backtrace (functions on the call stack) to see where `client` is coming from. – Chris Dodd Jun 16 '23 at 01:20
  • Ultimately, you'll probably find that to fix this, your `readAll` method needs to call some method on `node.node` to check if it is "readable" before calling `node.node->readValueAttribute()`. There may also be some synchronization required to avoid race conditions between the check and the call. – Chris Dodd Jun 16 '23 at 01:26
  • 1
    @user4581301 It's a bit of an edge case though since it (likely) happens after a few hours of running. I wouldn't exactly call that reproducible, first something would need to be found to trigger this quicker. – Gimby Jun 16 '23 at 13:29
  • 2
    @Gimby, please read again. It happened after hours in a real case. But after debugging I was able to replicate any time with the given simulation (`ufw` commands). – Mark Jun 16 '23 at 14:06
  • 1
    [Please do not upload images of code/data/errors.](//meta.stackoverflow.com/q/285551) – Rabbid76 Jun 17 '23 at 07:56

0 Answers0