1

I'm working on a distributed system with multiple workers.

Each worker is assigned a specific address - host and port.

As a worker, I'd like to know when another worker comes online and starts listening to their assigned address.

Currently I create a socket and keep trying to connect until I reach a timeout. An external function calls connectionOk in a loop.

bool MyClass::connectionOk(struct sockaddr_storage other_worker) {

    //connect and disconnect to verify peer is available
    int fd = socket(other_worker.ss_family, SOCK_STREAM, 0);
    int res = connect(fd, (const sockaddr*)&other_worker,  sizeof(other_worker));
    close(fd);

    return res == 0;
}

However that doesn't seem to work. A worker gets created, starts listening, but sometimes connectionOk returns false and timeout is reached.

Is there a better way to accomplish this?

Shtut
  • 1,397
  • 2
  • 14
  • 28
  • what `other_worker.ss_family` is equal too? Is socket blocking? And you have to check `errno` or your OS -specific error code source about what is going on. some errors aren't errors, e.g. operation can be interrupted and have to be repeated. – Swift - Friday Pie Jul 05 '21 at 11:27
  • @Swift-FridayPie Family is AF_INET. The socket is blocking. I'll try looking into errno, but right now I'm repeating the check every 10ms or so for a minute. I don't think it's something that can be fixed by trying again.. – Shtut Jul 05 '21 at 11:33
  • Why do you want to know this? Once you've closed `fd`, you can't tell whether the worker is listening or not without trying to connect to it again. "It was online a while ago" isn't very useful information. – molbdnilo Jul 05 '21 at 11:34
  • @molbdnilo this is part is a rendezvous operation - it should be fairly quick, but I don't want to accidentally start the process before making sure the other worker is online. just as a side note- currently to work around this I just wait a few seconds and that works fine, but I'd like to do be smarter about this and check if the connection is available beforehand. – Shtut Jul 05 '21 at 11:46
  • 1
    why you even doing so that often. connect may fail because there is no resource available. SOcket creation and connect attempt should be separate.. closing and reopening socket only aggravates the issue. – Swift - Friday Pie Jul 05 '21 at 11:46
  • @Swift-FridayPie Since I close the connection immediately after creating it, shouldn't it clear up some resources fairly quickly? if I do reach the resource limit of the connection, what can I do to prevent it? – Shtut Jul 05 '21 at 11:51
  • It's like with file system, but with very slow backend since you're using TCP with its FINWAIT. Quickly is relative term. Even 'shutdown' doesn't always help. Any result you getting might be outdated, interrupted or wrong. Application on other side could quit or crash but socket is still listening.Are you doing this from multiple threads? – Swift - Friday Pie Jul 05 '21 at 12:01
  • @Swift-FridayPie I'm simulating my distributed system using multiple processes (proc per worker) but every worker-pair gets their own ports, so there should be no collisions. Could you suggest a better way to accomplish what I'm trying to do? – Shtut Jul 05 '21 at 12:23
  • What is your proof that your worker is not hanging or is stuck somewhere, and thus is unable to accept any more connections, and the number of pending connection has reached maximum, and further connections are being rejected? In any case, without a [mre] that anyone can cut/paste, exactly as shown, to reproduce your results, it's unlikely that anyone will be able to tell you anything except a random guess. – Sam Varshavchik Jul 05 '21 at 12:23
  • @SamVarshavchik each worker gets their own port in regards to the other workers. so if I want to connect from A to B there's a specific port for it. C to B will use a different port. My "proof" that it's not stuck somewhere is that if I just add a 3 second wait (without checking if the worker is available) it works fine. I'd just like to save those 3 seconds and stop waiting as soon as the other worker is available. – Shtut Jul 05 '21 at 12:28
  • Your task reminds me of ZeroMQ. Did you consider using one worker as a publisher and all others as a subscribers? When everyone who goes online says to publisher "I am online". And subscribes. Publisher sends notification to each subscriber. In order to track if worker is online it sends heartbeat messages from time to time. And also notify everyone that someone went down. – Maxim Skvortsov Jul 05 '21 at 12:51
  • @MaximSkvortsov I can't do that sadly.. I need to implement an existing interface. 1 of the functions there is 'wait' which checks if the other worker is available until timeout is reached. this is not something I can avoid with a redesign.. – Shtut Jul 05 '21 at 12:59
  • > "I'd just like to save those 3 seconds and stop waiting as soon as the other worker is available." I am afraid in this case the only way to determine this - try to connect. But as already being mentioned you can't retry too frequently. Find a reliable minimal timeout you can accept and which allows you to get a connection. – Maxim Skvortsov Jul 05 '21 at 13:10
  • @Shtut "*I'm repeating the check every 10ms or so*" - that is way too short an interval. A TCP socket has to be bound to a local port before it can connect to a remote port. Since you are not specifying a local port (which is a common practice on the client side), the OS has to pick a random available Ephimeral port for you. When the socket is closed, its local port is released back to the OS, but it takes time to be ready for reuse. So, if you make a LOT of short-lived connections in a short amount of time, you can cause what is known as "Port Exhaustion". – Remy Lebeau Jul 05 '21 at 16:13
  • Perhaps relevant too https://stackoverflow.com/questions/1803566/what-is-the-cost-of-many-time-wait-on-the-server-side – Swift - Friday Pie Jul 06 '21 at 08:34
  • This sounds like you want a hard real-time distributed system,something like DIS. It's whole different realm of tasks where common consumer tools are not enough.Point TCP, sometimes even UDP are to inertial for such tasks and those tasks usually bid for proper OS and network transport. If Ethernet is used, I saw examples of RAW mode exploitation, that's poor man's choice – Swift - Friday Pie Jul 06 '21 at 08:43

0 Answers0