3

We have a table of 130GB and 4000 columns. When we select 2 of these columns, our Spark UI reports a total of 30GB read. However, if we select those two columns and store them as a separate dataset, the total size of the dataset is just 17MB. Given that parquet is columnar storage, something appears not to be working properly. I've found this question but I'm unsure how to diagnose further and what attempts to take to reduce the amount of I/O required.

It was my understanding that the benefit of columnar storage is that each column can be read more or less independently of each other.

We're running on Hadoop 2.7.X on Databricks. It occurs both on the 6.X and 7.X versions of databricks (spark 2.4/3.0)

pascalwhoop
  • 2,984
  • 3
  • 26
  • 40
  • When spark reads parquet, it indeed reads just the columns you requested. Perhaps the 30GB refers to uncompressed data and the 17MB is compressed? By default, spark will write compressed parquet. – Alon Catz Nov 13 '20 at 14:25
  • No it's both the same (compressed) data – pascalwhoop Nov 13 '20 at 14:48
  • Perhaps you can enable debug logging (as suggested in the answers to the question you quoted) and see? I would start by setting DEBUG level for `org.apache.hadoop.ParquetFileReader` or `org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat`, depending on how you read your table. – mazaneicha Nov 13 '20 at 14:56
  • @pascalwhoop Could you show the code that reproduces this? – mck Nov 13 '20 at 15:50

0 Answers0