5

I have written a pretty basic application in C# (.NET Compact Framework 2.0) using UDP sockets.

The program works fine for awhile (up to a couple weeks at a time), but always fails eventually. On top of my clients not being able to reconnect, this bug seems to adversely kill all activity from the associated NIC. Once this happens, I am no longer able to remote into the device (using CE Remote Display) - which is my only means of getting additional feedback for debugging. So at this point, I am not 100% certain whether the application itself crashes, or I am breaking something within the operating system via my socket code.

I have implemented an unhandled exception event that never gets raised. I also have a number of try/catch blocks that would output the exception message to a text file. I am not seeing any exceptions being thrown.

/// Removed old TCP code.

The clients themselves are simple little gateway devices that are configured as UDP servers. This is a remote system that I have access to sparingly, and although I have a test controller and gateway unit, the conditions are not identical and I have not yet been able to reproduce the issue on my end.

TIA for any feedback.

Edit:

I've been running with my test bench demo and periodically checking netstat on the server per some comment suggestions. In CE5 netstat does not take the -a flag so I've been using -n (not sure if this is going to tell me what I need...). I have been disconnecting and reconnecting my clients several times, forcing half-opens by unplugging Ethernet, etc. and the netstat table is only showing one connection per client (at the appropriate ports).

Edit 2:

Due to the sparse nature of the messaging during production, I changed the application over to connectionless UDP messaging, but I am still experiencing the same behavior (with about the same amount of time to failure). On my test hardware, the application runs successfully indefinitely with a high rate of messages (once every few seconds). However, in production where messages would be a lot less frequent, the program fails after running for about 10 days. I wouldn't think inactivity would matter, but perhaps I've got that wrong? Looking for any suggestions I can get.

New Send/Receive code:

    public void Send(string Message)
    {
        Socket udpClient = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);
        EndPoint ep = new IPEndPoint(IPAddress.Parse(_ipAddress), _port);

        udpClient.Connect(ep);

        byte[] data = Encoding.ASCII.GetBytes(Message);
        // async send, sync receive
        udpClient.BeginSendTo(data, 0, data.Length, SocketFlags.None, ep, (ar) =>
        {
            try
            {
                udpClient.EndSendTo(ar);
                _lastSent = Message;

                string msg = this.ReceiveSync(udpClient, 3);
                if (!string.IsNullOrEmpty(msg))
                {
                    _lastReceived = msg;
                    DataReceived(new ReceiveDataEvent(_lastReceived));
                }
            }
            catch { }
            finally
            {
                udpClient.Close();
            }

        }, null);
    }

    private string ReceiveSync(Socket UdpClient, int TimeoutSec)
    {
        string msg = "";
        byte[] recBuffer = new byte[256];

        int elapsed = 0;
        bool terminate = false;
        do
        {
            // check for data avail every 500ms until TimeoutSecs elapsed
            if (UdpClient.Available > 0)
            {
                int bytesRead = UdpClient.Receive(recBuffer, 0, recBuffer.Length, SocketFlags.None);
                msg = Encoding.ASCII.GetString(recBuffer, 0, recBuffer.Length);
                terminate = true;
            }
            else
            {
                if ((elapsed / 2) == TimeoutSec) 
                    terminate = true;
                else
                {
                    elapsed++;
                    System.Threading.Thread.Sleep(500);
                }
            }
        } while (!terminate);

        return msg;
    }
bstiffler582
  • 102
  • 11
  • Do you do exception handling in your async callbacks? – Oguz Ozgul Apr 06 '20 at 21:37
  • Can it be that, you are eating up all the available ports? (By re-conecting without disconnecting) Because "kill all activity on the NIC" is unheard of to me. – Oguz Ozgul Apr 06 '20 at 21:39
  • A know issue with TCP since it was first developed in the 1970s. I believe you have a connection that is half open and half closed. You can verify from cmd.exe >Netstat -a (run on both client and server). Your connection did not fully close so you cannot make a new connection. The issue is when you close a connection TCP requires an ACK and the connection doesn't close unless you get an ACK. When you simultaneously close a connection from both the client and server one gets the ACK and the other doesn't get the ACK.Always close a connection only from client. Remove the close from server. – jdweng Apr 06 '20 at 21:50
  • @jdweng I can see that being a problem with a particular client, but I don’t understand how that can cause the controller to stop receiving connections altogether (on that particular NIC). I do appreciate the info though - I will revise. – bstiffler582 Apr 06 '20 at 21:57
  • @Oguz Ozgul I do have the callback bodies wrapped in try/catch. For your second comment - maybe, but I would expect to be able to easily duplicate that scenario if that were the case. On my test controller I am able to connect/disconnect from multiple clients several times without any issue. In the production environment this should be happening seldomly. – bstiffler582 Apr 06 '20 at 21:59
  • Did you use Netstat -a to check the status of the connection? As I said the connection didn't close which will prevent you from opening a 2nd connection using same source IP, Destination IP , and port number. – jdweng Apr 06 '20 at 22:02
  • @jdweng when this problem occurs, the controller stops accepting connections from both my clients (on different IPs) as well as from the remote display tool (different application altogether, different IP and port). It’s as if my tool is affecting the OS. – bstiffler582 Apr 06 '20 at 22:07
  • it is not the port used by the remote display tool. The OS cannot find any available ports to ACCEPT the connection request. When a TCP connect request is made, an available port is needed to accept it. That's why I asked if this can be the issue. Please check on the clients (and if possible, on the server) with netstat tool to see if there are any connections in CLOSE_WAIT / LAST_ACK / UNKNOWN state. – Oguz Ozgul Apr 06 '20 at 22:19
  • @Oguz Ozgul my clients are little headless embedded devices. I would be able to run netstat on the server (CE5), however once this issue occurs I have no way of accessing it. There is no display port, so my only access to the server is via the remote display tool. – bstiffler582 Apr 06 '20 at 22:34
  • Ok, I understand that. But if this happens rarely, the signs going to a disaster must already be there. Eating up tens of thousands of ports must be taking time. – Oguz Ozgul Apr 06 '20 at 22:37
  • @Oguz Ozgul I see - I will run netstat after the application is running for some time and see if that’s the case. Thanks for your feedback! – bstiffler582 Apr 06 '20 at 22:39
  • 1
    You're welcome. Please stay home and be safe. – Oguz Ozgul Apr 06 '20 at 22:42
  • You can only have one connection with the same Source IP, Destination IP, and Port Number. When a connection doesn't fully close you can't make another connection with the same three properties. So the connection is blocked until the connection closes. Netstat will show the status in the TCP section with the port number. Netstat should either shows : Nothing, Listening, Connected. If you get something else it means that it is not closed. – jdweng Apr 06 '20 at 23:07
  • @jdweng Yes I do understand that. What I still haven’t been able to figure out is how my application is affecting unrelated network applications and devices. If the only problem was that the server in my program stopped accepting the gateway clients after some time, I wouldn’t be as concerned. What I want to figure out is why it breaks all network connectivity to that NIC. – bstiffler582 Apr 06 '20 at 23:16
  • The NIC is a server (a socket). So the entire listener is locked up. If you went to machine where NIC is running and used Netstat -a you will find it half closed/half opened. The entire port number is locked up. – jdweng Apr 06 '20 at 23:30
  • Would netstat -n be sufficient? Windows CE 5 does not have netstat -a... – bstiffler582 Apr 07 '20 at 00:31
  • I have updated the question and added a bounty @OguzOzgul – bstiffler582 May 01 '20 at 18:29
  • You have completely changed the question, from dealing with TCP to UDP. You should have closed this question and started a new one. You have invalidated the answers and comments present. – tcarvin May 05 '20 at 11:53
  • @tcarvin I have changed the protocol, but the underlying issue is identical. After X amount of time the network interface stops responding. If changing to UDP also changed the behavior of the system, then I would have posted as a new question. Given that the question title does not specify a protocol (and evidently the protocol is less relevant to the problem), I don’t see an issue with the edit. I can put the old TCP code back in if it would make you feel better... I just wanted to keep the question body short as to avoid discouraging viewers from reading. – bstiffler582 May 06 '20 at 13:44

1 Answers1

0

You probably run out of sockets on the server (Windows CE 5, 32bit OS). See similar at Is there a limit on number of tcp/ip connections between machines on linux?. "...Once a TCP socket is closed (by default) the port remains occupied in TIMED_WAIT status for 2 minutes..."

I am missing information on how many clients create/close connections per time. You probably have to thing about socket option SO_REUSEADDR (https://learn.microsoft.com/en-us/previous-versions/windows/embedded/ms884940%28v%3dmsdn.10%29).

You may do a circular network trace (30mins or so, just enough to have the chance to see what happens before, depends on how fast you can stop the trace after 'crash') in the server's subnet, to see what happens just before the 'crash'.

Another tought is to reboot the server periodically (one's in night), as all Windows Mobile CE devise do not run well 24/7.

Our customers use a lot of Windows Embedded Handheld 6.5 (CE5 based) devices. Even if they do not much network, these devices work most stable over the day, if they are rebooted every night. A periodic reboot would also reveal a faulty NIC driver on the CE5 server (who knows, some companies are not doing well in Platform software). Or try another vendor's NIC.

BTW: I have written my own netstat for Windows Mobile: http://www.hjgode.de/wp/2013/09/24/mobile-development-netstat-know-your-devices-open-ports/. I did not test it on Windows CE5, but it should work or can be made to work on CE5 too.

josef
  • 5,951
  • 1
  • 13
  • 24
  • Hey josef, thanks for the response. There are only two clients, and they are configured to use a persistent TCP connection. So, I would only expect to run out of sockets if the connection were very frequently being interrupted or lost. That is not out of the question as this is an industrial environment, but still unexpected to this degree. – bstiffler582 Apr 08 '20 at 12:14
  • Only a network trace may verify that the clients use the same connection all the time. – josef Apr 08 '20 at 17:53
  • I changed the application over to UDP and am experiencing the same behavior, with about the same amount of time to failure. In my testing, I was able to watch (using netstat -n) the connection get created and assigned a port. The ports would increment each message all the way up to 65535 and then simply roll over, continuing to run successfully. However, in production where messages would be a lot less frequent, the application still fails after running for ~10 days. Any ideas? – bstiffler582 May 01 '20 at 18:18
  • Possibly the clients do more strange things than your test environment. Best is to go with a network trace. – josef May 10 '20 at 17:42