Spark Streaming mapwithstate fails after 48+ hours with checkpoint write issue

Question

We have a spark streaming program that reads input from Kafka using createDirectStream and creates a composite object based on common key using mapWithState.

JavaMapWithStateDStream<String, InputData, Trip, Tuple2<String, CompositeData>> mappedDStream = inputMessages.mapWithState(StateSpec.function(mappingFunc).timeout(Durations.minutes(timeOutMinutes)));

We are running this code on a 3 machine Hadoop YARN cluster, with a hdfs checkpoint directory specified. Hadoop version is 2.7.0 and Spark 2.0

The streaming interval specified is 3 seconds. The program runs continuously for a period of 48 to 72 hours and fails with the below exception.

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /.streamingcheckpoint/app1/2b86771a-0771-4f5a-a8cf-878f79a29d03/rdd-167/.part-00024-attempt-3 could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)

We have referred to the below answers, however in our case there is enough space available on the cluster. (cluster disk utilization is below 30%), and even after this failure the name node is active and we are able to use the hdfs commands to add files to hdfs. We even increased the number of threads available to namenode.

Ref: could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation

Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

we are also seeing the below message written continuously in our logs right from the start.

[rdd_11_28] (org.apache.spark.executor.Executor) [2017-03-16 11:35:00,690] WARN 1 block locks were not released by TID = 202:

What could be the cause of this job failure ?

Spark Streaming mapwithstate fails after 48+ hours with checkpoint write issue

0 Answers0