Can anyone explain how storage level of rdd works.
I got heap memory error when I use persist method with storage level(StorageLevel.MEMORY_AND_DISK_2()) However my code works fine when I use cache method.
As per spark doc documentation cache Persist RDD with the default storage level (MEMORY_ONLY
).
My code where I get heap error
JavaRDD<String> rawData = sparkContext
.textFile(inputFile.getAbsolutePath())
.setName("Input File").persist(SparkToolConstant.rdd_stroage_level);
// cache()
String[] headers = new String[0];
String headerStr = null;
if (headerPresent) {
headerStr = rawData.first();
headers = headerStr.split(delim);
List<String> headersList = new ArrayList<String>();
headersList.add(headerStr);
JavaRDD<String> headerRDD = sparkContext
.parallelize(headersList);
JavaRDD<String> filteredRDD = rawData.subtract(headerRDD)
.setName("Raw data without header").persist(StorageLevel.MEMORY_AND_DISK_2());;
rawData = filteredRDD;
}
Stack trace
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 10, localhost): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1176)
at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1185)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:846)
at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:668)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:176)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
Spark version : 1.3.0