I suffer message loss with NATS publish (Core, not Jetstream yet).
Using NATS CLI on Windows to subscribe as sub ">"
Using NATS server on Linux Ubuntu on local LAN.
Application on Windows using NATS C Client (latest GitHub version).
The following code reproduces the problem (Possibly on on FAST CPUs at the client side. I used AMD Threadripper 16, 32 and 64 Cores and Intel i7-10810U, they all have it).
The problem occurs already with a SINGLE message, idle network and NATS server dedicated to this test, hence no other traffic or heavy load on NATS server.
You need to provide a logged on connection to this method (code not shown, contains my keys). Select test with 'case' 1,2 or 3 to see the different scenarios and workaround.
#include "nats.h"
natsStatus PublishIt(natsConnection* nc) {
// Create subject
std::string subject = "test";
// Fill a buffer with data to send
char* buf = new char[1024];
int len = sprintf(buf, "This is a reliability test to see if NATS looses messages on fast systems and if possibly the provided buffer is cloned after the natsConnection_Publish() function allready returned. If that is the case it would explain NATS high performance but while being unreliable depending on the underlying CPU speed and thread-lottery.");
// Publish
natsStatus nstat = natsConnection_Publish(nc, subject.c_str(), (const void*) buf, len);
if (nstat != NATS_OK) { printf("natsConnection_Publish() Failed"); return nstat; } // <<< Never failed
// Select the test according remarks next to the 'case' statements.
int selectTest = 3;
switch (selectTest)
{
case 1: // This looses messages. NATS CLI doesn't display
delete[] buf;
break;
case 2: // This is a memory leak BUT NEVER looses any message and above text appears on NATS CLI
// Will eventually run out of memory of course and isn't an acceptable solution.
// do nothing, just don't delete buf[]
break;
case 3: // This is a workaround that doesn't loose messages and NATS CLI shows text BUT it looses performance.
nstat = natsConnection_Flush(nc);
if (nstat != NATS_OK) printf("NATS Flush Failed: %i", nstat); // <<< Flush never failed.
delete[] buf;
break;
}
return nstat;}
Is there anyone that has a better solution than the flush() above. Something tells me that in an even faster CPU, or if core dedication would become possible, this workaround is not going to hold. My reasoning is that the flush() just creates sufficient time for some underlying async. action to consume the buffer before it is deleted.
I tried with a single flush() with 2 sec timeout just before disconnecting, but that doesn't work. The flush must be between the publish call and the deletion of the buffer. And that means it must be called on EVERY SINGLE publish, which is a performance problem.
The documentation at http://nats-io.github.io/nats.c/group__conn_pub_group.html#gac0b9f7759ecc39b8d77807b94254f9b4 doesn't say anything about whether caller needs to relinquish the buffer, hence I delete it. Maybe there is other documentation but the above one claims to be the official one.
Thanks for any additional information.