Connection reset by peer and No space left on device errors for apache spark?

Question

I kept getting the following on my stack trace very often

WARN TransportChannelHandler: Exception in connection from /172.31.3.245:46014
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:898)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)

and eventually I get No space left on device error but after researching I found that I could do .set("spark.local.dir", "/home/ubuntu/sparktempdata"); this has reduced the frequency of "No space left on device" errors in my trace however there is one more remaining such as the one below and I am not sure how to fix it?

16/09/06 08:34:18 ERROR FileAppender: Error writing stream to file /usr/local/spark/work/app-20160906083355-0000/1/stderr
java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:326)
    at org.apache.spark.util.logging.FileAppender.appendToFile(FileAppender.scala:92)
    at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply$mcV$sp(FileAppender.scala:75)
    at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62)
    at org.apache.spark.util.logging.FileAppender$$anonfun$appendStreamToFile$1.apply(FileAppender.scala:62)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
    at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:78)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
    at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
    at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)

and when I open the file /usr/local/spark/work/app-20160906083355-0000/1/stderr I see the following

INFO Utils: Fetching spark://172.31.11.187:58519/jars/analytics-1.0-SNAPSHOT.jar to /tmp/spark-69b1866b-f302-4ab8-a25f-f2a8cc1f4b4f/executor-99c9eeb0-d45c-4619-8054-7f6d3f15803c/spark-c28a16b5-5ac5-440b-9e4d-7ed1b1b8bcbe/fetchFileTemp6564441043886275791.tmp
16/09/06 08:34:18 WARN TransportChannelHandler: Exception in connection from /172.31.11.187:58519
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        at sun.nio.ch.SinkChannelImpl.write(SinkChannelImpl.java:167)
        at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadCallback.onData(NettyRpcEnv.scala:395)
        at org.apache.spark.network.client.StreamInterceptor.handle(StreamInterceptor.java:69)
        at org.apache.spark.network.util.TransportFrameDecoder.feedInterceptor(TransportFrameDecoder.java:202)
        at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:70)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.pro
                                                                                                                   425,2-9       Bot

here is my df -h on my worker node. Also, All my machines have same amount of resources

Filesystem      Size  Used Avail Use% Mounted on
udev            7.4G   12K  7.4G   1% /dev
tmpfs           1.5G  344K  1.5G   1% /run
/dev/xvda1      7.8G  7.3G   92M  99% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            7.4G     0  7.4G   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/xvdb        37G   49M   35G   1% /mnt

And did you verify whether indeed, /usr/local/spark/work/app-20160906083355-0000/1/stderr has become enormeous or whether your diskspace is indeed running low at the time of error? Perhaps post your df -h output — chrisvp, Sep 06 '16 at 08:50
yes I am pasting everything that I see from the logs..Also I have added lot more details so please reread the question and I would be happy to answer if you have more. — user1870400, Sep 06 '16 at 09:02
well i guess you do run out of diskspace, as the error says. Up to you to see what is stored in the log file. Perhaps you log too much, perhaps an error occurs which keeps transmitting log messages. — chrisvp, Sep 06 '16 at 09:12
@chrisvp my application doesn't log anything (and as a matter of fact it is a very simple code with no loggers or anything). I just pasted what is in that log file — user1870400, Sep 06 '16 at 09:17
Is home on /mnt? It looks like you have just moved from directory on the root mount point to another. The root mount point is full in your df. Then you may want to put on rolling logs on executors if that continues to be an issue. — RussS, Sep 06 '16 at 21:58
@RussS home is under /home and I tried setting .set("spark.executor.logs.rolling.maxSize", "10000"); Still the issue persisted but I understand the essence of what you are saying. so I deployed it on larger machines which has around 30G and the issue was gone and my driver is doing what it is expected to do. In short, you are awesome! — user1870400, Sep 06 '16 at 23:12

score -1 · Answer 1 · edited May 23 '17 at 12:13

-1

I think this error is due to broken pipe. basically the client(say your laptop did not hear anything from the server for a long time, so it assumes it is not connected any more. use SIGPIPE command and set it two 2 minutes. the below link will help you.

broken pipe link

edited May 23 '17 at 12:13

Community

1
1

answered Jan 12 '17 at 11:06

braj

2,545
2
29
40

Connection reset by peer and No space left on device errors for apache spark?

1 Answers1