0

I have a strange problem as follows.

I am preparing a Device Client with Python 2.7. There are Tracking devices (amount ~1100 active and running) which sends signals to server. They have periodic signal that is sent once every hour. (Signal sent frequency is changing according to situation but they must sent at least one GPS Position Data signal once every hour)

Those devices are running in long connection mode, that means a connection initiated by the device should be alive for 3-4 hours. For keeping this connection alive,thy sent Heart Beat Signals (they are not GPS position signals, but they are signals that contains some data). Heartbeat signal interval is 15 minutes.

Below is my script for listening a TCP port

class Server(object):
    def __init__(self, host, sock_port, buffsize=1024):
        self.hostname = host
        self.sock_port = sock_port
        self.buffsize = buffsize
        self.socket = None

    def start(self):
        self.log.info("Listening: ")
        self.socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        self.socket.bind((self.hostname, self.sock_port))
        self.socket.listen(1024)

        while True:
            conn, address = self.socket.accept()
            thread.start_new_thread(GV55LiteHandler(conn=conn, buff_size=self.buffsize).handle_data, ())

This is the method that is called when Socket server receives a new connection:

class GV55LiteHandler():
    ....

    def handle_data(self):
        while True:
            try:
                _veri = self.conn.recv(self.buff_size)
                if not _veri:
                    # We do not recieve any data...
                    raise NoIncomingDataException()
            except NoIncomingDataException:
                break
            except Exception as h_e:
                print h_e
                break
            else:
                self.control_data(_veri)
        self.conn.close()

After a while, I check (using psutil) the number of threads of the process and see the total number of threads are greater than 5.000. I evaluate this as some devices have dead connections that looks like still active, but dropped by the device and a new connection is established. Considering the total number, each device looks like created 4 connections, closed them when the long connection time is over (set within the device) and established a new connection. That is said to be normal in some situations and have no effect. But after a while, I get reports that some devices could not connect! Then I kill the Port listening script and re-start it and within 10 minutes, all devices that could not connect and sent data starts to sent data again. Have some research over this but can not find anything about the situation. My best guess is, after a device established too many connections (I have a similar tracking device with a different manufacturer which I have ~120 active devices and see a total running threads of 1600, which means each devices established and fail to drop 10 previous connection, and then establish a brand new one like the previous ones) the server do not accept any new connection from that device, or the device fails to create a new TCP connection to server and GPS data is not sent until script is restarted and all connections are dropped.

These tracking devices are running on single data connection. That means, no device can have 2 active data connections and sent data using both (this is meaningless too).

I tried to set TCP connection time out to TCP connection as below:

conn, address = self.socket.accept()
conn.settimeout(10800)

and handle this in the data processing script:

try:
    _veri = self.conn.recv(self.buff_size)
    if not _veri:
        # We do not recieve any data...
        raise NoIncomingDataException()
except NoIncomingDataException:
    # No need to log anything in here...
    break
except socket_timeout:
    print "Socket Timeout"
    break

That seems to work and now I do not have any device that could not sent GPS data. But on the other hand, conn.settimeout is not setting the connection timeout properly, and after a while, after 30 seconds of the last signal, the connection is timeout by the conn.settimeout. I expect it to set the timeout to 3 hours but it fails and the connection is dropped after ~20 minutes and a new Heart Beat signal is sent to open the new connection, followed by the GPS position signal. GPS signal must be sent once every hour but when settimeout is defined, I received that signal once every 20 minutes.

I use blocking sockets (the default socket behaviour). Do not try non-blocking sockets (and do not have much knowledge of them too).

How can I get rid of inactive connections that causes devices not to sent data without breaking the long-connection mode of the devices?

Update: I never hit NoIncomingDataException in the handle_data method in both settimeout version and no-timeout version.

Update 2: I have Debian GNU/Linux 6.0.10 in my server. My /etc/sysctl.conf configuration:

net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_ecn = 0

Above python lines are the onlyones that configure socket, hence I only have setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1). I do not have any configuration for socket.SO_KEEPALIVE in the python script.

Mp0int
  • 18,172
  • 15
  • 83
  • 114
  • At least if you hit the `raise NoIncomingDataException()` or any other exceptions, you seem to never call `self.conn.close()` - so make sure you're not leaking connections. Also, does `self.conn.recv` detect the missing heartbeats ? You'll need some way to detect a dead client, so you can close its connection - and it's not apparent from the code that you do that. – nos Sep 28 '15 at 08:41
  • No, I never hit that exception. – Mp0int Sep 28 '15 at 08:42
  • Those 20 minutes look curiously like the default TCP keepalive timeout on some systems (75000 seconds). Can you check if TCP keepalive is enabled on the sockets (i.e. what OS are you using?). If not, try enabling it. If it is enabled, you might try a lower value (using [`setsockopt()`](http://stackoverflow.com/questions/12248132/how-to-change-tcp-keepalive-timer-using-python-script)). I need to look this up, but I'm pretty sure that `settimeout()` only affects blocking operations, it doesn't close the socket after timeout. – dhke Sep 28 '15 at 10:05
  • @dhke thanks, question updated – Mp0int Sep 28 '15 at 11:16

0 Answers0