We have a table of 130GB and 4000 columns. When we select 2 of these columns, our Spark UI reports a total of 30GB read. However, if we select those two columns and store them as a separate dataset, the total size of the dataset is just 17MB. Given that parquet is columnar storage, something appears not to be working properly. I've found this question but I'm unsure how to diagnose further and what attempts to take to reduce the amount of I/O required.
It was my understanding that the benefit of columnar storage is that each column can be read more or less independently of each other.
We're running on Hadoop 2.7.X on Databricks. It occurs both on the 6.X and 7.X versions of databricks (spark 2.4/3.0)