I am seeing an issue with Spark Caching. I am reading a Parquet data (around 50 GB) (compressed with Snappy) using Spark through spark-shell. Then I cache this data using option MEMORY_ONLY_SER. Data is 100% cached. Surprisingly this data occupies 500GB in Cache.
Is there a way to ensure that cache contains around 50GB of data only? I tried setting spark.io.compression.codec = "org.apache.spark.io.SnappyCompressionCodec" and spark.rdd.compress = true, but this did not give me what I was looking for. By default spark.sql.inMemoryColumnarStorage.compressed is true and spark.sql.inMemoryColumnarStorage.batchSize is set to 10000
Further I tried caching this data with option "MEMORY_ONLY". Data is 100% cached but the space it occupies in cache is 500 GB i.e. same as in case of MEMORY_ONLY_SER. I expected this to be more. So it seems storing data in serialized format is not helping. Any clue ???
Also I noticed that if I run simple query like 'get distinct count for a column' against Parquet data on disk, the operation reads only 5 GB of data out of total 50 GB (i.e. reads only specific column), whereas if I run the same query after 100% of data is cached (= 500 GB), the operation reads/processes entire 500 GB of cached data i.e. does not read data specific to given column, which is again strange, any idea ???