0

I am using netty.io (4.0.4) in a java application to implement a TCP client to communicate with an external hardware driver. One of the requirements of this hardware is, the client send a KEEP_ALIVE (heart-beat) message every 30 seconds, the hardware however does not respond to this heat-beat. My problem is, when the connection is abruptly broken (eg: network cable unplugged) the client is completely unaware of this, and keeps sending the KEEP_ALIVE message for much longer (around 5-10 minutes) before it gets an operation timeout exception. In other words, from the client side, there is no way to tell if its still connected.

Below is a snippet of my bootstrap setup if it helps

// bootstrap setup
bootstrap = new Bootstrap().group(group)
            .channel(NioSocketChannel.class)
            .option(ChannelOption.SO_KEEPALIVE, true)
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 3000)
            .remoteAddress(ip, port)
            .handler(tcpChannelInitializer);


// part of the pipeline responsible for keep alive messages
    pipeline.addLast("idleStateHandler", new IdleStateHandler(0, 0, 30, TimeUnit.SECONDS));
    pipeline.addLast("keepAliveHandler", keepAliveMessageHandler);

I would expect since the client is sending keep alive messages, and those messages are not received at the other end, a missing acknowledgement should indicate a problem in the connection much earlier?

EDIT

Code from the KeepAliveMessageHandler

public class KeepAliveMessageHandler extends ChannelDuplexHandler
{

    private static final Logger LOGGER = getLogger(KeepAliveMessageHandler.class);

    private static final String KEEP_ALIVE_MESSAGE = "";


    @Override
    public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception
    {
        if (!(evt instanceof IdleStateEvent)) {
            return;
        }

        IdleStateEvent e = (IdleStateEvent) evt;
        Channel channel = ctx.channel();

        if (e.state() == IdleState.ALL_IDLE) {
            LOGGER.info("Sending KEEP_ALIVE_MESSAGE");
            channel.writeAndFlush(KEEP_ALIVE_MESSAGE);
        }
    }
}

EDIT 2

I tired to explicitly ensure the keep alive message delivered using the code below

@Override
public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception
    {
        if (!(evt instanceof IdleStateEvent)) {
            return;
        }

        IdleStateEvent e = (IdleStateEvent) evt;
        Channel channel = ctx.channel();

        if (e.state() == IdleState.ALL_IDLE) {
            LOGGER.info("Sending KEEP_ALIVE_MESSAGE");
            channel.writeAndFlush(KEEP_ALIVE_MESSAGE).addListener(future -> {

                if (!future.isSuccess()) {
                    LOGGER.error("KEEP_ALIVE message write error");
                    channel.close();
                }
            });
        }
    }

This also does not work. :( according to this answer this behavior makes sense, but I am still hoping there is some way to figure-out if the write was a "real" success. (Having the hardware ack the hear-beat is not possible)

codeCruncher
  • 376
  • 6
  • 14
  • 1
    Maybe take a look at the answer over here? https://stackoverflow.com/questions/21358800/tcp-keep-alive-to-determine-if-client-disconnected-in-netty – Cᴏʀʏ Oct 06 '17 at 21:27
  • thanks for that link, I looked at that before I asked the question, the issues I have with that solution are: a. since the network cable is unplugged, no normal closing of channel is possible b. implementing the ReadTimeoutHandler wont work, because the hardware doesnot say much, so this would be triggered way too often :/ (the ack I am talking about in the question is TCP layer ack not application level). Make sense? Maybe what I want is not even possible by TCP, and thats part of the question. – codeCruncher Oct 06 '17 at 21:31
  • I would expect you to get a 'connection reset' or 'software caused connection abort' after a couple of minutes. Are you sure you're detecting send errors correctly when you send the hearbeats? – user207421 Oct 07 '17 at 03:33
  • @EJP maybe I am not detecting errors correctly, all I am doing is sending the heart-beat like so.. IdleStateEvent e = (IdleStateEvent) evt; Channel channel = ctx.channel(); if (e.state() == IdleState.ALL_IDLE) { LOGGER.info("Sending KEEP_ALIVE_MESSAGE"); channel.writeAndFlush(KEEP_ALIVE_MESSAGE); } – codeCruncher Oct 07 '17 at 13:46

1 Answers1

1

You have enabled the TCP Keepalive

.option(ChannelOption.SO_KEEPALIVE, true)

But in your code I can't see any piece that ensures keepalive to be sent at 30 seconds rate.

If a connection has been terminated due to a TCP Keepalive time-out and the other host eventually sends a packet for the old connection, the host that terminated the connection will send a packet with the RST flag set to signal the other host that the old connection is no longer active. This will force the other host to terminate its end of the connection so a new connection can be established.

Typically TCP Keepalives are sent every 45 or 60 seconds on an idle TCP connection, and the connection is dropped after 3 sequental ACKs are missed. This varies by host, e.g. by default Windows PCs send the first TCP Keepalive packet after 7200000ms (2 hour)s, then sends 5 Keepalives at 1000ms intervals, dropping the connection if there is no response to any of the Keepalive packets.

(taken form http://ltxfaq.custhelp.com/app/answers/detail/a_id/1512/~/tcp-keepalives-explained_

I do understand now that

pipeline.addLast("idleStateHandler", new IdleStateHandler(0, 0, 30, TimeUnit.SECONDS));
pipeline.addLast("keepAliveHandler", keepAliveMessageHandler);

Will trigger an idle event every 30 seconds on mutual inactivity and keepAliveMessageHandler will sent a packet to remove side in this case.

Unfortunately

ChannelFuture future = channel.writeAndFlush(KEEP_ALIVE_MESSAGE);

is considered success when it is written to OS buffers.

It seems that under your conditions you have only 2 optios:

  1. Sending a command that will have some response from external device (something that will not cause distruption)
    But I would assume that this is impossible in your case.

  2. Modyfying underlying TCP driver settings
    The default OS settings for TCP keepalive are more about conserving system resources to support large amount of applications and connections. Provided that you have a dedicated system you may set more aggressive TCP checks configuration. Here is the link on how to make adjustments to linux kernel: http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
    The solution should work as on plain installations as well in VMs and Docker containers.

General information on the topic: https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html

Aleh Maksimovich
  • 2,622
  • 8
  • 19
  • I did take a look at the SO post, and like I said in the follow-up comment, the solution wont work (I tried it, it throws an exception if there is no read, that does not necessarily mean the connection is dead, which is not what I want). Also I have added code for my KeepAliveHandler to the original question. I do appreciate your effort to help – codeCruncher Oct 07 '17 at 13:52
  • Now it a way different story. I have an update for you. If it is not of help please add information about your send timeout, retry count and what exactly your KEEP_ALIVE_MESSAGE is. – Aleh Maksimovich Oct 07 '17 at 14:55
  • So I did try out handling the ChannelFuture that is returned after the writeAndFlush() call, like this: channel.writeAndFlush(KEEP_ALIVE_MESSAGE).addListener(future -> { if (!future.isSuccess()) { LOGGER.error("KEEP_ALIVE message write error"); channel.close(); } }); but this doesnot work, the if block is not executed! I read that netty says success when the data was written to the IO buffer, not when its received at the other end. – codeCruncher Oct 07 '17 at 15:28
  • You are right operation is considered success when it is written to OS buffers. Unfortunately I don't see an option for you other than 1) Modyfying underlying OS keepalive settings 2) Sending a command that will have some response from external device (something that will not cause distruption) – Aleh Maksimovich Oct 07 '17 at 16:28
  • Last update from me. I have posted some info on reconfiguring linux kernel TCP settings if you ever consider this option. – Aleh Maksimovich Oct 07 '17 at 19:45