PySpark mergeSchema on Read operation Parquet vs Avro

Asked Apr 07 '22 at 07:28

Active Apr 07 '22 at 07:28

Viewed 185 times

I have around 200 parquet files with each parquet file having a different schema and I am trying to read these parquet files using mergeSchema enabled during read and it takes almost 2 hours. If I instead create equivalent Avro files and try to read them using the mergeSchema option on read ( Available only on Databricks runtime 9.3 LTS ) , it can do the merge within 5 minutes.

Question - Why does Parquet Schema merge on Read take too long whereas the Avro files are faster ?

asked Apr 07 '22 at 07:28

smati

Does [this](https://stackoverflow.com/questions/28957291/avro-vs-parquet) answer your question ? – Dipanjan Mallick Apr 07 '22 at 07:55
Yes it does to some extent. I understand that Parquet is not meant for highly denormalized data and processing or reading entire such records rather than few. Thank you. – smati Apr 14 '22 at 04:26

PySpark mergeSchema on Read operation Parquet vs Avro

0 Answers0