spark 1.2.0 mllib kmeans: Out Of Memory Error

Question

I'am new to spark, and I use KMeans algorithm to cluster a data set, which size is 484M, 213104 dimensions, and my code as follow:

val k = args(0).toInt
val maxIter = args(1).toInt
val model = new KMeans().setK(k).setMaxIterations(maxIter).setEpsilon(1e-1).run(trainingData)
val modelRDD = sc.makeRDD(model.clusterCenters)
val saveModelPath = "/home/work/kMeansModel_" + args(0)
if(Files.exists(Paths.get(saveModelPath))) {
  FileUtils.deleteDirectory(new File(saveModelPath))
}
modelRDD.saveAsTextFile(saveModelPath)
val loss = model.computeCost(trainingData)
println("Within Set Sum of Squared Errors = " + loss)

when I set K = 150, it works, but when I set K = 300, it throws java.lang.OutOfMemoryError: Java heap space exception. My configuration:

--executor-memory 30G --driver-memory 4G --conf spark.shuffle.spill=false --conf spark.storage.memoryFraction=0.1

Is it possible for you to try Spark 1.3? The release notes mention a more performant implementation of KMeans (but could only apply to KMeans||) — stholzm, Apr 29 '15 at 05:16
are you using SBT to execute your program? Have you tried allocating more memory for the process? — GameOfThrows, Apr 29 '15 at 10:06
I tried allocating more memory, from 20G to 30G, but the issue is still — ifloating, Apr 30 '15 at 06:01

David S. · Answer 1 · 2015-04-30T07:04:35.403

0

You should tell us more about the environment. Are you running in a real cluster, or in local mode?

Since you said you are new to Spark, I assume you are just playing around on your local machine. In this case, I think this post can help you.

Update

Your error is not really OOM, but heap space exception. Did you cache your RDD?

edited Apr 30 '15 at 07:04

answered Apr 29 '15 at 11:25

David S.

10,578
12
62
104

I deploy spark on only one machine, and the worker's configuration as follow: `export SPARK_WORKER_CORES=11 export SPARK_WORKER_MEMORY=40g`, and I running application with `bin/spark-submit --class "userClustering" --master spark://md-machinelearning0-bgp0.hy01:7077 --executor-memory 30G --driver-memory 4G --conf spark.shuffle.spill=false --conf spark.storage.memoryFraction=0.1 /home/work/downloadDetail.jar` – ifloating Apr 30 '15 at 06:12
yes, I cached the RDD originally and then I try no caching it, but the issue is still. The `spark.storage.memoryFraction=0.1`, executor memogry 30G and the data set size 484M, so the memory used to cache is enough – ifloating May 05 '15 at 03:21

spark 1.2.0 mllib kmeans: Out Of Memory Error

1 Answers1

Update