3

I am working on implementing an algorithm and testing it on medium-sized data in Spark (the Scala interface) on a local node. I am starting with very simple processing and I'm getting java.lang.OutOfMemoryError: Java heap space even though I'm pretty sure the data isn't big enough for such an error to be reasonable. Here is the minimal breaking code:

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}

val conf = new SparkConf()
  .setMaster("local[4]")
  .setAppName("AdultProcessing")
  .set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)

val dataFile = "data/census/processed/census-income.data"
val censusData: RDD[String] = sc.textFile(dataFile, 4)
val censusDataPreprocessed = censusData.map { row =>
  val separators: Array[Char] = ":,".toCharArray
  row.split(separators)
}

val res = censusDataPreprocessed.collect()

the data I'm using is the classic census data, uncompressed. It's 100MB and almost 200k rows. The amount of memory on my machine should be more than sufficient:

nietaki@xebab$ free -tm
             total       used       free     shared    buffers     cached
Mem:         15495      12565       2929          0        645       5608
-/+ buffers/cache:       6311       9183
Swap:         3858          0       3858
Total:       19354      12566       6788

The chunks of the data file are under 30MB for each of the virtual nodes and the only processing I'm performing is splitting row strings into arrays of under 50 items. I can't believe this operation alone should use up the memory.

While trying to debug the situation I have found that reducing the number of nodes to 1, or, alternatively, increasing the SparkContext.textFile()'s minPartitions argument from 4 to 8 for example cures the situation, but it doesn't make me any wiser.

I'm using Spark 1.0.0 and Scala 2.10.4. I am launching the project directly from sbt: sbt run -Xmx2g -Xms2g.

nietaki
  • 8,758
  • 2
  • 45
  • 56
  • 1
    I haven't used Spark, but your experience seems to mirror that of my lab's. Small amounts of data explodes in Spark. You're not crazy. Good luck. – Isaac Jun 24 '14 at 19:30
  • My experience seems to be consistent with advice from this answer: http://stackoverflow.com/a/22742982/246337 increasing the number of partitions to 2x number of nodes sems to work. – nietaki Jun 24 '14 at 19:34
  • 2
    100MB on disk != 100MB in memory – om-nom-nom Jun 24 '14 at 19:36

1 Answers1

1

The JVM is memory hungry. Spark runs on the JVM.

I'd recommend you to inspect the heap with a profiler to find out the actual memory used by your records. In my case they were 2x the size "at rest" and they were a combination of primitive types and Strings.

In your case, Strings are particularly mem-eaters. "" (the empty string) is ~40 bytes long - longer strings offset the cost of the structure. see [1]

Applying the table in the previous resource to your data:

line: String = 73, Not in universe, 0, 0, High school graduate, 0, Not in universe, Widowed, Not in universe or children, Not in universe, White, All other, Female, Not in universe, Not in universe, Not in labor force, 0, 0, 0, Nonfiler, Not in universe, Not in universe, Other Rel 18+ ever marr not in subfamily, Other relative of householder, 1700.09, ?, ?, ?, Not in universe under 1 year old, ?, 0, Not in universe, United-States, United-States, United-States, Native- Born in the United States, 0, Not in universe, 2, 0, 95, - 50000.

line.size
Int = 523
def size(s:String) = s.size/4*8+40
line.split(",.").map(w=>size(w)).sum
Int = 2432

So, thanks to all those small strings, your data in memory is roughly 5x the size 'at rest'. Still, 200k lines of that data makes up for roughly 500MB. This might indicate that your executor is operating at the default valie of 512MB. Try setting 'spark.executor.memory' to a higher value, but also consider a heap size >8Gb to confortably work with Spark.

maasg
  • 37,100
  • 11
  • 88
  • 115
  • I had 'spark.executor.memory' set to 1g from the beginning, now increased it to 2g and it doesn't change the behaviour whatsoever. I haven't done any java profiling before so I might be looking at it wrong but it [didn't look like I was running out of heap space either](http://i.imgur.com/bRzn7tQ.png) – nietaki Jun 24 '14 at 20:57
  • I'm also curious what the general approach would be in similar situations (similar problems) in normal scenarios. My guess would be operating on adequate data types (which I was going towards before being held back by this issue), operating on RDDs as long as possible before collecting, increasing parallelism and partition counts in settings (???)... – nietaki Jun 24 '14 at 21:20
  • Interesting. I ran into a similar situation but was due to actual memory limits being reached see [gc pressure goes bum](https://drive.google.com/file/d/0BznIWnuWhoLlOEQ5UDIzWGR5RzA/edit?usp=sharing). – maasg Jun 24 '14 at 21:29