I'am new to spark, and I use KMeans algorithm to cluster a data set, which size is 484M, 213104 dimensions, and my code as follow:
val k = args(0).toInt
val maxIter = args(1).toInt
val model = new KMeans().setK(k).setMaxIterations(maxIter).setEpsilon(1e-1).run(trainingData)
val modelRDD = sc.makeRDD(model.clusterCenters)
val saveModelPath = "/home/work/kMeansModel_" + args(0)
if(Files.exists(Paths.get(saveModelPath))) {
FileUtils.deleteDirectory(new File(saveModelPath))
}
modelRDD.saveAsTextFile(saveModelPath)
val loss = model.computeCost(trainingData)
println("Within Set Sum of Squared Errors = " + loss)
when I set K = 150, it works, but when I set K = 300, it throws java.lang.OutOfMemoryError: Java heap space
exception. My configuration:
--executor-memory 30G --driver-memory 4G --conf spark.shuffle.spill=false --conf spark.storage.memoryFraction=0.1