how to set spark RDD StorageLevel in hive on spark?

Question

In my hive on spark job , I get this error :

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

thanks for this answer (Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?) , I know it may be my hiveonspark job has the same problem

since hive translates sql to a hiveonspark job, I don't how to set it in hive to make its hiveonspark job change from StorageLevel.MEMORY_ONLY to StorageLevel.MEMORY_AND_DISK ?

thanks for you help~~~~

Aravind Yarram · Answer 1 · 2016-01-17T03:04:36.123

1

You can use CACHE/UNCACHE [LAZY] Table <table_name> to manage caching. More details.

If you are using DataFrame's then you can use the persist(...) to specify the StorageLevel. Look at API here..

In addition to setting the storage level, you can optimize other things as well. SparkSQL uses a different caching mechanism called Columnar storage which is a more efficient way of caching data (as SparkSQL is schema aware). There are different set of config properties that can be tuned to manage caching as described in detail here (THis is latest version documentation. Refer to the documentation of version you are using).

spark.sql.inMemoryColumnarStorage.compressed
spark.sql.inMemoryColumnarStorage.batchSize

edited Jan 17 '16 at 03:04

answered Jan 17 '16 at 00:18

Aravind Yarram

78,777
46
231
327

thanks for your answer, but sparksql is the same as hiveonspark? and I want to know that that spark job in HIVE ON SPARK, when data is out of memory, can that data write to disk? if can, it is default or I must set something? – liu young Jan 18 '16 at 11:43
I believe they both are same in the sense that they both use same runtime – Aravind Yarram Jan 18 '16 at 14:06

how to set spark RDD StorageLevel in hive on spark?

1 Answers1