0

I am seeing an issue with Spark Caching. I am reading a Parquet data (around 50 GB) (compressed with Snappy) using Spark through spark-shell. Then I cache this data using option MEMORY_ONLY_SER. Data is 100% cached. Surprisingly this data occupies 500GB in Cache.

  1. Is there a way to ensure that cache contains around 50GB of data only? I tried setting spark.io.compression.codec = "org.apache.spark.io.SnappyCompressionCodec" and spark.rdd.compress = true, but this did not give me what I was looking for. By default spark.sql.inMemoryColumnarStorage.compressed is true and spark.sql.inMemoryColumnarStorage.batchSize is set to 10000

  2. Further I tried caching this data with option "MEMORY_ONLY". Data is 100% cached but the space it occupies in cache is 500 GB i.e. same as in case of MEMORY_ONLY_SER. I expected this to be more. So it seems storing data in serialized format is not helping. Any clue ???

  3. Also I noticed that if I run simple query like 'get distinct count for a column' against Parquet data on disk, the operation reads only 5 GB of data out of total 50 GB (i.e. reads only specific column), whereas if I run the same query after 100% of data is cached (= 500 GB), the operation reads/processes entire 500 GB of cached data i.e. does not read data specific to given column, which is again strange, any idea ???

sunillp
  • 983
  • 3
  • 13
  • 31
  • The 3rd point is expected (see related https://stackoverflow.com/a/50380109 and https://stackoverflow.com/q/49798098/9613318) – Alper t. Turker Aug 07 '18 at 17:10
  • Actually my Parquet data is deeply nested and I read that Spark is not able to work with nested parquet data. Will exploding this data and then caching it help in any way? – sunillp Aug 14 '18 at 06:03

0 Answers0