Estimating required memory for Scala Spark job

Question

I'm attempting to discover how much memory will be required by Spark job.

When I run job I receive exception :

15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661
15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:61983+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:0+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:61983+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:41322+20661
15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661
15/02/12 12:01:11 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: Java heap space

Many more messages with "15/02/12 12:01:08 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661" are printed, just truncating them here for brevity.

I'm logging the computations and after approx 1'000'000 calculations I receive above exception.

The number of calculations required to finish job is 64'000'000

Currently I'm using 2GB of memory so does this mean to run this job in memory without any further code changes will require 2GB * 64 = 128GB or is this a much too simpistic method of anticipating required memory ?

How is each split file such as "15/02/12 12:01:09 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:20661+20661" generated ? These are not added to file system as "file:/c:/data/example.txt:20661+20661" does not exist on local machine ?

Before answering to the question how much memory do I need? You should answer to the question: what is memory greedy? As for local files, they should be available to all workers with the same path (using a shared folder or replication). `example.txt:61983+20661` describes the partition/split `61983+2066` on file `example.txt` — G Quintana, Feb 12 '15 at 15:42
@GQuintana "they should be available to all workers with the same path" alternatively, "broadcast variables, which can be used to cache a value in memory on all nodes" from http://spark.apache.org/docs/1.2.0/programming-guide.html. What determines the value "61983+2066" ? — blue-sky, Feb 12 '15 at 15:57
Loading an RDD as a Broadcast variable means being able to load the entire RDD in each node. Whereas partitionning splits the file into chunks, each node processing some chunks. The value "61983+2066" is determined by the number of partitions and the file size: there is a `minPartitions` argument in the `.textFile` method. — G Quintana, Feb 12 '15 at 16:19
I guess that "61983+20661" means the partition starting at line 61983 and ending at line 61983+20661 — G Quintana, Feb 13 '15 at 11:21
@G Quintana 61983 lines is more than the number of lines file in example.txt(which is being processed). When I increase the number of partitions then the number of lines is spread evenly between each partition, so im not sure how Spark is computing the 61983 value ? — blue-sky, Feb 13 '15 at 13:51
Then I guess it's 61983 is a byte length. The value is computed in Hadoop's FileInputFormat class. — G Quintana, Feb 13 '15 at 14:50

score 0 · Accepted Answer · edited May 23 '17 at 11:43

To estimate the amount of required memory I've used this method :

use http://code.google.com/p/memory-measurer/ as described at : Calculate size of Object in Java

Once setup can use below code to estimate size of Scala collection and in turn this will provide an indication of required memory by Spark application :

object ObjectSizeDriver extends Application {

  val toMeasure = List(1,2,3,4,5,6);

  println(ObjectGraphMeasurer.measure(toMeasure));
  println(MemoryMeasurer.measureBytes(toMeasure));

}

Estimating required memory for Scala Spark job

1 Answers1

Linked