39

I have a program that consists of a master server and distributed slave servers. The slave servers send status updates to the server, and if the server hasn't heard from a specific slave in a fixed period, it marks the slave as down. This is happening consistently.

From inspecting logs, I have found that the slave is only able to send one status update to the server, and then is never able to send another update, always failing on the call to connect() "Cannot assign requested address (99).

Oddly enough, the slave is able to send several other updates to the server, and all of the connections are happening on the same port. It seems that the most common cause of this failure is that connections are left open, but I'm having trouble finding anything left open. Are there other possible explanations?

To clarify, here's how I'm connecting:

struct sockaddr *sa; // parameter
size_t           sa_size; //parameter
int              i = 1;
int              stream;

stream = socket(AF_INET,SOCK_STREAM,0);
setsockopt(stream,SOL_SOCKET,SO_REUSEADDR,&i,sizeof(i));
bindresvport(stream,NULL);
connect(stream,sa,sa_size);

This code is in a function to obtain a connection to another server, and a failure on any of those 4 calls causes the function to fail.

dbeer
  • 6,963
  • 3
  • 31
  • 47

5 Answers5

22

It turns out that the problem really was that the address was busy - the busyness was caused by some other problems in how we are handling network communications. Your inputs have helped me figure this out. Thank you.

EDIT: to be specific, the problems in handling our network communications were that these status updates would be constantly re-sent if the first failed. It was only a matter of time until we had every distributed slave trying to send its status update at the same time, which was over-saturating our network.

dbeer
  • 6,963
  • 3
  • 31
  • 47
  • 2
    I would love an elaboration on “busy” in case it is the cause of the same error over here in my own code — do you mean “the server accepting connections had too long a queue of sockets waiting for accept() for another connection to be allowed on the queue?” Or another circumstance? Thanks! – Brandon Rhodes Mar 07 '13 at 13:32
  • 6
    @BrandonRhodes our problem was that we had some retrying happening without a proper backoff algorithm, so we had hundreds or more of connection attempts to the same socket every second. This contention was causing our failure. Implementing a proper backoff algorithm was crucial to solving this problem. – dbeer Mar 07 '13 at 16:26
12

Maybe SO_REUSEADDR helps here? http://www.unixguide.net/network/socketfaq/4.5.shtml

Michel
  • 2,523
  • 2
  • 17
  • 14
  • SO_REUSEADDR is set for all connections. – dbeer Oct 03 '11 at 21:23
  • 2
    here's one similar : http://stackoverflow.com/questions/3886506/why-would-connect-give-eaddrnotavail – dmh2000 Oct 03 '11 at 22:35
  • @dmh2000 - I looked at that example before posting and haven't had success trying to look into those factors. I'm wondering if I just need to keep looking or if there's something I'm not taking into account. – dbeer Oct 04 '11 at 17:10
  • 2
    Is that function you talk about executed multiple times? Do you close the socket before calling connect again? Can you explain the difference between "status updates" and "other updates" in your question? I'm confused why you say "...slave is only able to send one status update..." and then "...slave is able to send several other updates...". – Michel Oct 04 '11 at 18:31
  • @Michel - the connection is closed immediately after sending the update and receiving confirmation that it was received. The 'other updates' are mostly reporting on tasks that the server asked the slave to perform. The slave is able to contact the server for this kind of reporting, but not for its status update. Its perplexing. The other updates mostly write back on a socket opened from the master server, and for these updates the slave opens the connection. – dbeer Oct 04 '11 at 20:48
7

this is just a shot in the dark : when you call connect without a bind first, the system allocates your local port, and if you have multiple threads connecting and disconnecting it could possibly try to allocate a port already in use. the kernel source file inet_connection_sock.c hints at this condition. just as an experiment try doing a bind to a local port first, making sure each bind/connect uses a different local port number.

dmh2000
  • 683
  • 3
  • 7
  • Sorry, I wasn't looking at my code when I posted that. I do call a bind before connect. I will update my question to show better what I'm doing. – dbeer Oct 04 '11 at 17:12
6

Okay, my problem wasn't the port, but the binding address. My server has an internal address (10.0.0.4) and an external address (52.175.223.XX). When I tried connecting with:

$sock = @stream_socket_server('tcp://52.175.223.XX:123', $errNo, $errStr, STREAM_SERVER_BIND|STREAM_SERVER_LISTEN);

It failed because the local socket was 10.0.0.4 and not the external 52.175.223.XX. You can checkout the local available interfaces with sudo ifconfig.

Dallas Clarke
  • 261
  • 2
  • 5
  • This saved me a ton of time - thanks @Dallas Clarke! Same issue/solution applies to AWS EC2 instances. – Stan S. Mar 09 '22 at 16:31
-18
sysctl -w net.ipv4.tcp_timestamps=1
sysctl -w net.ipv4.tcp_tw_recycle=1
Soli
  • 484
  • 5
  • 8