STM32 Interrupt driven UART receival fails after several flawless receives

Question

Please note the clarification and update at the end of the post

TL;DR: An STM32 has 3 UART connections, 1 for debugging and 2 for actual communication which use the interrupt-driven HAL_UART_Receive_IT. Initially, interrupt driven UART receive works fine, though over time the receive callback for one of the UARTs fires less and less until eventually the STM32 doesn't receive any packets on that one UART at all (despite me being able to verify that they were sent). I suspect the issue to be timing related.

Situation: As part of my thesis, I developed a novel protocol which now has to be implemented and tested. It involves two classes of actors, a server and devices. A device consists of an STM32, ESP32 and a UART to Ethernet bridge. The STM32 is connected via UART to the bridge and via UART to the ESP32. The bridge connects the STM32 to the server by converting serial data sent by the STM32 to TCP packets which it forwards to the server (and vice versa). The ESP32 receives framed packets from the STM32, broadcasts them via BLE and forwards all received and well-formed BLE packets to the STM32. I.e. the ESP32 is just a BLE bridge. The server and ESP32 seem to be working flawlessly.

In a nutshell, the server tries to find out which devices D_j can hear BLE advertisements from device D_i. The server does that by periodically iterating over all devices D_1, ..., D_n and sends them nonces Y_1, ..., Y_n encrypted as X_1, ..., X_n. Upon D_i receiving X_i, it decrypts it to get Y_i, which it then forwards to the ESP32 to be broadcasted via BLE. Conversely, whenever the STM32 receives a packet from the ESP32 (i.e. a packet broadcasted via BLE), it extracts some data, encrypts it and forwards it to the server.

After the server has iterated over all devices, it looks at all the messages it received during that round. If it e.g. received a message with value Y_i sent by D_j, it can deduce that D_i's broadcast somehow arrived at D_j.

Problem: The way I have it set up right now, each STM32 seems to occasionally "miss" messages sent by the ESP32. The more such devices I have in my setup, the worse it gets! With just two devices, the protocol works 100% of the time. With three devices, it also seems to work fine. However, with four devices the STM32's UART receive callback for the ESP32 works fine initially, but after a couple of such rounds it doesn't trigger all the time until eventually it doesn't trigger at all.

Visualization: The below picture shows a sample topology of n devices. Not drawn here, but if e.g. D_1 was to receive Y_2, it would encrypt it to X_2' and send it across the bridge to the server.

N.B.:

Encryption and Decryption each take ca. 130ms
Average one way delay for one ESP32 receiving packet, broadcasting it and another ESP32 receiving is ca. 15ms
I am aware that UART is not a reliable protocol per se and that one should use framing in a real setting. Nevertheless, I was instructed to just assume that UART is perfect and doesn't drop anything.
Due to the larger scope of the project, using an RTOS is not an option

Code:

#define LEN_SERVER_FRAMED_PACKET 35
#define LEN_BLE_PACKET           24

volatile bool_t new_server_msg;
volatile bool_t new_ble_msg;

byte_t s_rx_framed_buf[LEN_SERVER_FRAMED_PACKET];   // Receive buffer to be used in all subsequent Server send operations
ble_packet_t ble_rx_struct;        // A struct. The whole struct is then interpreted as uint8_t ptr. when being sent to the ESP32 over UART

Init:

< set up some stuff>
err = HAL_UART_Receive_IT(&SERVER_UART, s_rx_framed_buf, LEN_SERVER_FRAMED_PACKET);
if (!check_success_hal("Init, setting Server ISR", __LINE__, err)){
    print_string("Init after Signup: Was NOT able to set SERVER_UART ISR");
}else{
    print_string("Init after Signup: Was able to set SERVER_UART ISR");

}
err = HAL_UART_Receive_IT(&BLE_UART, &ble_rx_struct, LEN_BLE_PACKET);
if(!check_success_hal("Init, setting BLE ISR", __LINE__, err)){
    print_string("Init after Signup: Was NOT able to set BLE_UART ISR");
}else{
    print_string("Init after Signup: Was able to set BLE_UART ISR");

}

Main loop:

while (1)
{

    // (2) Go over all 3 cases: New local alert, new BLE message and new Server message and handle them accordingly

    // (2.1) Check whether a new local alert has come in
    if (<something irrelevant happens>)
    {
        <do something irrelevant>
    }

    // (2.2) Check for new ble packet. Technically it checks for packets from the UART to the ESP32.
    if (new_ble_msg)
    {
        new_ble_msg = FALSE;
        int ble_rx_type_code = ble_parse_packet(&ble_rx_nonce, &ble_rx_struct);
        HAL_UART_Receive_IT(&BLE_UART, &ble_rx_struct, LEN_BLE_PACKET);                           // Listen for new BLE messages.
        <compute some stuff, rather quick> server_tx_encrypted(<stuff computed>, &c_write, "BLE", __LINE__); // Encrypts <stuff computed> and sends it to the server using a BLOCKING HAL_UART_Transmit(...).
                                                                                                             // Encryption takes ca. 130ms.
    }

    // (2.3) Check for new server packet
    if (new_server_msg)
    {
        new_server_msg = FALSE;                                             // Set flag to false
        memcpy(s_wx_framed_buf, s_rx_framed_buf, LEN_SERVER_FRAMED_PACKET); // Copy from framed receive buffer to framed working buffer.
                                                                            // This is done such that we can process the current message while also being able to receive new messages

        HAL_UART_Receive_IT(&SERVER_UART, s_rx_framed_buf, LEN_SERVER_FRAMED_PACKET); // Listen for new server messages.

        <decrypt it, takes ca.130 - 150ms. results in buffer ble_tx_struct>

            err = HAL_UART_Transmit(&BLE_UART, ble_tx_struct,
                                    LEN_BLE_PACKET, UART_TX_TIMEOUT);
        check_success_hal(err); // If unsuccessful, print that to debug UART
    }

    /* USER CODE END WHILE */

    /* USER CODE BEGIN 3 */
}

UART receive callback function:

void HAL_UART_RxCpltCallback(UART_HandleTypeDef *huart)
{

    if (huart == &SERVER_UART)
    { // One should technically compate huart -> Instance, but that works aswell...
        new_server_msg = TRUE;
        print_string("UART Callback: Server ISR happened!\r\n"); // Blocking write to debug UART. I know that this is typically considered bad form,
                                                                 // but as the callback function is only called once per receive and because that's the only way of letting me know that the callback has occurred,
                                                                 // I chose to keep the print in.
    }
    else if (huart == &BLE_UART)
    {
        new_ble_msg = TRUE;
        print_string("UART Callback: BLE ISR happened!\r\n");
    }
    else
    {
        print_string("UART Callback: ISR triggered by unknown UART_HandleTypeDef!\r\n");
    }
}

What I have tried so far:

I wrote a client implementation in Go and ran it on my computer, where clients would just directly send UDP messages to each other instead of BLE. As that version functioned flawlessly even with many "devices", I am confident that the problem lies squarely at the STM32 and its STM32 <-> ESP32 UART connection.

To get it working with 3 devices, I simply removed most of the debugging statements of the STM32 and made the server wait 250ms between sending X_i to D_{i} and X_{i + 1} to D_{i + 1}. As this seems to have at least made the problem so infrequent that I haven't noticed it anymore, I reckon that the core issue is timing related.

Through drawing execution traces, I have already found an inherent weakness to my approach: if an STM32 calls HAL_UART_Receive_it(&BLE_UART, ble_rx_buf, LEN_BLE_PACKET) while the ESP32 is currently transmitting a packet to the STM and has already sent k bytes, the STM32 will only receive LEN_BLE_PACKET - k bytes. This causes the BLE_UART.RxXferCount to be wrong for when the next packet is sent by the ESP32.

On a more theoretical front, I first considered doing DMA instead of interrupt driven receive. I then refrained however, as in the STM32 DMA doesn't use descriptor rings like in more powerful systems but instead really just removes the overhead from having to receive LEN_BLE_PACKET (resp LEN_SERVER_FRAMED_PACKET) interrupts.

I have also already of course checked stackoverflow, several people seem to have experienced similar issues. E.g. UART receive interrupt stops triggering after several hours of successful receive, "Uart dma receive interrupt stops receiving data after several minutes" .

Questions:

Given what I have described above, how is it possible for the STM32's callback of BLE_UART to simply stop triggering after some time without any apparent reason?
Does it seem plausible that the issue I raised in the last paragraph of "What I have tried so far" is actually the cause of the problem?
How can I fix this issue?

Clarification:

After the server sends a request to a device D_i, the server waits for 250ms before sending the next request to D_{i + 1}. Hence, the D_i has a 250ms transmission window in which no D_j can transmit anything. I.e. when it's D_i's turn to broadcast its nonce, the other devices have to simply receive one UART message.

As the receival from the server is typically rather fast, the decryption takes 130ms and the UART transmit with a baud of 115200 is also quick, this window should be long enough.

UPDATE:

After posting the question, I changed the ESP32 such that BLE packets are not immediately forwarded over UART to the STM32. Instead, they are enqueued and a dedicated task in the ESP32 dequeues them with a minimum 5ms delay between packets. Hence, the STM32 should now have a guaranteed 5ms between each BLE packet. This was done to reduce the burstiness (despite there not actually being any bursts due to what is mentioned in the clarification... I was just desperate). Nevertheless, this seems to have made the STM32 "survive" for longer before the UART receiver locking up.

have you created a test project whose only code is uart receive plus maybe add some debugging or are you only trying to get this to work as a fraction of a whole application? use an ftdi or other usb to (3.3v) TTL uart and use minicom or whatever to send the characters. nothing fancy, something adhoc. — old_timer, Jun 23 '21 at 01:42
if you want to debug this in general it is best to cut the problem in half and in half and in half (getting rid of stuff each pass). Likewise burn the candle from both ends, start from the application as is now and start to hack and slash things out of it. And start from the adhoc programs used to learn each peripheral and start to re-glue the application together again from nothing. As you hit each dead end take a break from it and push on a prior dead end. — old_timer, Jun 23 '21 at 01:44
Naturally for a project like this you want/need a scope, what do you see on the scope between the parts. is the esp32 5V or 3.3V? All of the basic stuff that one has to ask when starting and debugging a project like this. — old_timer, Jun 23 '21 at 01:45
have you hooked a usb/uart breakout to the esp32 to see what it is sending? (if you dont have a scope, a project like this or in general if you are going to wire things up like this, a scope becomes crucial). — old_timer, Jun 23 '21 at 01:46
It sounds very much like an UART bug related for flags or race conditions. Does it use DMA or interrupt? In case of DMA, where is the code? In case of interrupts, where is the code? That's all we need to know. — Lundin, Jun 23 '21 at 06:35
@old_timer 'an rtos would only make this worse...' make what worse, and how? A preemptive kernel typically improves I/O performance - it is the main reason they exist. — Martin James, Jun 23 '21 at 06:59
@old_timer Thanks for the suggesstions! (1) It is currently a standalone application, but later on will be integrated into something bigger. That's where the "I can't use an RTOS" restriction comes from. — iMrFelix, Jun 23 '21 at 07:00
(3-4) I have hooked up an FTDI USB to serial converter in each direction between the STM32 and the ESP32. I can confirm that the ESP32 always sends the exact data it should send (even after the STM32 fails) and that when the STM32 is working, it also sends what it should. Annoyingly, when I want to send something over the serial USB/Serial converters to the STM32 (e.g. a single byte, trying to see if IRQ happens), I have to unplug the tx wire from the ESP32 as it (apparently, can't validate without a scope) keeps the line pulled high, effectively preventing me from sending... — iMrFelix, Jun 23 '21 at 07:05
(5) Like MartinJames I am also not sure why that is the case. In FreeRTOS, I would just spawn three tasks, one for each of the two UARTs and one for processing. The UART tasks would block until they read data and then send it over a queue to the processing task. — iMrFelix, Jun 23 '21 at 07:09
@Lundin It uses interrupts, the HAL interrupt code is completely unchanged, I only wrote my own callback function which can be found in my question under "UART receive callback function". I.e. the actualy ISR is completely stock, only the final callback which is called after (in the case of the UART to the ESP32) `LEN_BLE_PACKET` bytes have been sent by the ESP32. — iMrFelix, Jun 23 '21 at 07:15
If you print to a second UART from inside a callback which is in turned called from an ISR then no wonder you are missing packages. Remove the `print_string` lines. — Lundin, Jun 23 '21 at 07:54
@Lundin I agree that the blocking prints to UART within a callback are dangerous, though notice that as I stated in my update, the server gives each D_i a 250ms window within which only D_i is allowed to broadcast anything. Hence, I believe the < 10ms (blocking write timeout was set to 10ms) are note the cause of the problem. Nevertheless, I'll try removing them! — iMrFelix, Jun 23 '21 at 12:17
It's not only dangerous, it is senseless. Most commonly you get an UART rx interrupt per byte received. So lets say you use 115.2kbps. 10 bits standard UART frame. It takes 87us to receive a byte. If your ISR isn't finished in <87us and the UART hardware has no rx FIFO, you will lose data. This results in a buffer overrun error, so you can easily check if this is the cause by checking the overrun error flag. — Lundin, Jun 23 '21 at 13:17
@Lundin That is not the case, as the rx interrupt happens every byte, though callback != ISR. The actual interrupt routine in my case is `UART_RxISR_8BIT` (stm32l5xx_hal_uart.c, line 4113 in my case), which _only_ calls the callback procedure if _all_ `LEN_BLE_PACKET` bytes have been received. — iMrFelix, Jun 23 '21 at 13:50
I believe with one of the stm32 uarts if you get a single overflow at any time then all incoming is stopped. There is an override bit for that which I set as part of my uart init. but if you are not seeing any overflows then that is good. — old_timer, Jun 23 '21 at 14:39
I recommend still that you do an isolated experiment that is uart rx only, use a usb uart and manually enter x number of bytes and confirm the interrupt is only happening at that watermark level and not per byte. (dont trust library calls, dont trust docs, test it yourself) — old_timer, Jun 23 '21 at 14:40
@old_timer I did check that and the receive callback only fires upon successfully receiving `LEN_BLE_PACKET` bytes. A clear indication that the callback only happens upon receiving the desired number of bytes is that my serial feed was not bombarded with "ISR happened" messages (guess I should have mentioned that :-) ) Nevertheless, I indeed just discovered an overrun error! I found it by setting a breakpoint within the `HAL_UART_ErrorCallback` and then combing through the USART_ISR register! — iMrFelix, Jun 23 '21 at 14:45
are you sure the overrun was legit or because you used a breakpoint to stop the processing of incoming data? — old_timer, Jun 23 '21 at 14:55
Not sure if I understand why you mean it might not be legit. I had no other breakpoints set and the breakpoint is inside the `HAL_UART_ErrorCallback`. I.e. the breakpoint being reached indicates that the error has happened and the overrun-error bit in the USART_ISR registed was set to 1. — iMrFelix, Jun 23 '21 at 15:03

score -2 · Answer 1 · answered Aug 30 '21 at 15:45

You need to be very careful especially when using STM32 HAL library for production, the libraries isn't reliable when receiving fast and continuous data from the server or anywhere else.

I will suggest a solution to this problem based on what I did when implementing for similar application. This works well for my Firmware-Over-The-Air(FOTA) project and helps to eliminate any possible UART failures when using STM32 HAL library.

Steps are listed below:

Ensure you reset the UART by calling MX_USARTx_UART_Init()
Reconfigure the callback either for HAL_UART_Receive_IT() or HAL_UART_Receive_DMA()

This two settings would eliminate any UART failure for receive interrupt using STM32 HAL.

It looks for me like voodoo programming. Trial and error. Maybe one combination of function calls will work - but we do not know where the problem is. — 0___________, Aug 30 '21 at 15:51

STM32 Interrupt driven UART receival fails after several flawless receives

1 Answers1