Spark caching with Parquet data

Question

I am seeing an issue with Spark Caching. I am reading a Parquet data (around 50 GB) (compressed with Snappy) using Spark through spark-shell. Then I cache this data using option MEMORY_ONLY_SER. Data is 100% cached. Surprisingly this data occupies 500GB in Cache.

Is there a way to ensure that cache contains around 50GB of data only? I tried setting spark.io.compression.codec = "org.apache.spark.io.SnappyCompressionCodec" and spark.rdd.compress = true, but this did not give me what I was looking for. By default spark.sql.inMemoryColumnarStorage.compressed is true and spark.sql.inMemoryColumnarStorage.batchSize is set to 10000
Further I tried caching this data with option "MEMORY_ONLY". Data is 100% cached but the space it occupies in cache is 500 GB i.e. same as in case of MEMORY_ONLY_SER. I expected this to be more. So it seems storing data in serialized format is not helping. Any clue ???
Also I noticed that if I run simple query like 'get distinct count for a column' against Parquet data on disk, the operation reads only 5 GB of data out of total 50 GB (i.e. reads only specific column), whereas if I run the same query after 100% of data is cached (= 500 GB), the operation reads/processes entire 500 GB of cached data i.e. does not read data specific to given column, which is again strange, any idea ???

The 3rd point is expected (see related https://stackoverflow.com/a/50380109 and https://stackoverflow.com/q/49798098/9613318) — Alper t. Turker, Aug 07 '18 at 17:10
Actually my Parquet data is deeply nested and I read that Spark is not able to work with nested parquet data. Will exploding this data and then caching it help in any way? — sunillp, Aug 14 '18 at 06:03

Spark caching with Parquet data

0 Answers0