When I use the cache
to store data,I found that spark is running very slow. However, when I don't use cache
Method,the speed is very good.My main profile is follows:
SPARK_JAVA_OPTS+="-Dspark.local.dir=/home/wangchao/hadoop-yarn-spark/tmp_out_info
-Dspark.rdd.compress=true -Dspark.storage.memoryFraction=0.4
-Dspark.shuffle.spill=false -Dspark.executor.memory=1800m -Dspark.akka.frameSize=100
-Dspark.default.parallelism=6"
And my test code is:
val file = sc.textFile("hdfs://10.168.9.240:9000/user/bailin/filename")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).cache()..reduceByKey(_+_)
count.collect()
Any answers or suggestions on how I can resolve this are greatly appreciated.