3

I have a Flink Streaming job, it failed and I got the log as below.Can anyone tell me how to solve the problem? It sometimes failed after one day running, and sometimes failed after a few hours.

09:30:25 948  INFO (org.apache.flink.runtime.executiongraph.ExecutionGraph:1240) - TriggerWindow(TumblingProcessingTimeWindows(600000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer@ece0f926}, ProcessingTimeTrigger(), WindowedStream.process(WindowedStream.scala:563)) -> Filter -> Filter -> Map (40/48) (19ea993ced2b161422c345c9b633853a) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager . This indicates that the remote task manager was lost.
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:146)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
    at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
    at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
    at org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
    at org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
    at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
    at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
    at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
    ... 6 more
Daisy
  • 61
  • 1
  • 4
  • Possible duplicate of [When is "java.io.IOException:Connection reset by peer" thrown?](https://stackoverflow.com/questions/8658118/when-is-java-io-ioexceptionconnection-reset-by-peer-thrown) – mrkernelpanic Aug 22 '18 at 06:36
  • It cannot solve my problem. As I know, for taskManager, a NettyConnectionManager manages both a Netty client and a Netty server instance. I do not know what might cause this exception, because there are no more error logs. – Daisy Aug 22 '18 at 07:11
  • You should look for the root cause on the other side of the connection. Try searching for the id of operator that fails. Something wrong is happening with the TaskManager that executes this Task. There might be many different problems why the connection is dropped. – Dawid Wysakowicz Aug 22 '18 at 07:17
  • It is the failed chain Window -> Filter -> Filter -> Map that caused the problem? How can I get the exact failed operator? The dashboard can not use after the job failed, and there are no more error logs.@DawidWysakowicz – Daisy Aug 22 '18 at 08:12
  • You can grep for the hashcode in brackets. What is your setup? Do you run multiple TaskManagers? I suppose so cause you have remote connections. You should look in log files of all your TMs – Dawid Wysakowicz Aug 22 '18 at 15:35
  • Have you ended up solving this issue? – Luca Cappelletti Jan 26 '19 at 20:56

1 Answers1

1

I ended up finding the root cause in job manager log:

- Closing TaskExecutor connection container_e06_1554425226316_0158_01_000024 because: Container [pid=14446,containerID=container_e06_1554425226316_0158_01_000024] is running beyond physical memory limits. Current usage: 12.5 GB of 12.5 GB physical memory used; 14.7 GB of 26.2 GB virtual memory used. Killing container.

so I increased TM memory

Yue Liu
  • 131
  • 1
  • 8