6

I have a problem with something and I'm guessing it's the code.

The application is used to 'ping' some custom made network devices to check if they're alive. It pings them every 20 seconds with a special UDP packet and expects a response. If they fail to answer 3 consecutive pings the application sends a warning message to the staff.

The application is running 24/7 and for a random number of times a day (2-5 mostly) the application fails to receive UDP packets for an exact time of 10 minutes, after which everything goes back to normal. During those 10 minutes only 1 device seems to be replying, others seem dead. That I've been able to deduce from the logs.

I've used wireshark to sniff the packets and I've verified that ping packets are going both out AND in, so the network part seems to be working okay, all the way to the OS. The computers are running WinXPPro and some have no configured firewall whatsoever. I'm having this issue on different computers, different windows installs and different networks.

I'm really at a loss as to what might be the problem here.

I'm attaching the relevant part of the code which does all the network. This is run in a separate thread from the rest of the application.

I thank you in advance for whatever insight you might provide.

def monitor(self):
    checkTimer = time()
    while self.running:
        read, write, error = select.select([self.commSocket],[self.commSocket],[],0)
        if self.commSocket in read:
            try:
                data, addr = self.commSocket.recvfrom(1024)
                self.processInput(data, addr)
            except:
                pass

        if time() - checkTimer > 20: # every 20 seconds
            checkTimer = time()
            if self.commSocket in write:
                for rtc in self.rtcList:
                    try:
                        addr = (rtc, 7) # port 7 is the echo port
                        self.commSocket.sendto('ping',addr)
                        if not self.rtcCheckins[rtc][0]: # if last check was a failure
                            self.rtcCheckins[rtc][1] += 1 # incr failure count
                        self.rtcCheckins[rtc][0] = False # setting last check to failure
                    except:
                        pass

        for rtc in self.rtcList:
            if self.rtcCheckins[rtc][1] > 2: # didn't answer for a whole minute
                self.rtcCheckins[rtc][1] = 0
                self.sendError(rtc)

2 Answers2

3

You don't mention it, so I have to remind you that since you are using select() that socket better be non-blocking. Otherwise your recvfrom() can block. Should not really happen when dealt with properly, but hard to tell from the short code snippet.

Then you don't have to check UDP socket for writability - it is always writable.

Now for the real problem - you are saying that packets are entering the system, but your code does not receive them. This is most probably due to the overflow of the socket receive buffer. Did the number of ping targets increase over those last 15 years? You are setting yourself up for a ping-response storm, and probably not reading those responses fast enough, so they pile up in the receive buffer and eventually get dropped.

My suggestions in order of ROI:

  • Spread out ping requests, don't set yourself up for a DDOS. Query, say, one system per iteration and keep last check time per target. This will allow you to even out the number of packets out and in.
  • Increase SO_RCVBUF to a large value. This will allow your network stack to better deal with packet bursts.
  • Read packets in a loop, i.e. once your UDP socket is readable (assuming it's non-blocking), read until you get EWOULDBLOCK. This would save you bunch of select() calls.
  • See if you can use some advanced Windows API along the lines of Linux recvmmsg(2), if such thing exists, to dequeue multiple packets per syscall.

Hope this helps.

Nikolai Fetissov
  • 82,306
  • 11
  • 110
  • 171
  • Actually the socket was in blocking mode, but I've had some logging in place that confirmed I never had that problem. As for the possible DDOS, this problem happens in systems with 4 devices aswell as 20 (which is the largest deployed system we have) so I don't trully believe it to be a matter of a DOS. I will take your suggestions into the code and come back with results. Thanks! – flowInTheDark Jul 19 '12 at 07:01
  • Making the buffer larger didn't help, oddly enough. What did help in the end was your suggestion to read the socket in a loop until EWOULDBLOCK every time I get it readable. Now it's working as it should. Thank you! – flowInTheDark Aug 07 '12 at 13:06
0

UDP does not guarantee reliable transmission. This could work now, in the next hour, and on the next year. Then in two years it will fail to communicate for a whole hour.

The route path of the packets may be blocked in some situations. When that happens with TCP, the sender is informed for the loss, and the sender may try to send it through a different route path. Because UDP is "send-and-forget" transmission protocol, you may lose some of your packets statistically.

tl;dr Use TCP.

iTayb
  • 12,373
  • 24
  • 81
  • 135
  • 1
    Please do notice the part of the text where I mention my wireshark sniffing and confirming that packets actually entered the system. – flowInTheDark Jul 18 '12 at 08:01