I am working on implementing an algorithm and testing it on medium-sized data in Spark (the Scala interface) on a local node. I am starting with very simple processing and I'm getting java.lang.OutOfMemoryError: Java heap space
even though I'm pretty sure the data isn't big enough for such an error to be reasonable. Here is the minimal breaking code:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
val conf = new SparkConf()
.setMaster("local[4]")
.setAppName("AdultProcessing")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
val dataFile = "data/census/processed/census-income.data"
val censusData: RDD[String] = sc.textFile(dataFile, 4)
val censusDataPreprocessed = censusData.map { row =>
val separators: Array[Char] = ":,".toCharArray
row.split(separators)
}
val res = censusDataPreprocessed.collect()
the data I'm using is the classic census data, uncompressed. It's 100MB and almost 200k rows. The amount of memory on my machine should be more than sufficient:
nietaki@xebab$ free -tm
total used free shared buffers cached
Mem: 15495 12565 2929 0 645 5608
-/+ buffers/cache: 6311 9183
Swap: 3858 0 3858
Total: 19354 12566 6788
The chunks of the data file are under 30MB for each of the virtual nodes and the only processing I'm performing is splitting row strings into arrays of under 50 items. I can't believe this operation alone should use up the memory.
While trying to debug the situation I have found that reducing the number of nodes to 1, or, alternatively, increasing the SparkContext.textFile()
's minPartitions argument from 4 to 8 for example cures the situation, but it doesn't make me any wiser.
I'm using Spark 1.0.0 and Scala 2.10.4. I am launching the project directly from sbt: sbt run -Xmx2g -Xms2g
.