1

I have multiple parquet files in the form of - file00.parquet, file01.parquet, file02.parquet and so on. All the files follow the same schema as file00.parquet. How do I add the files one below the other, starting from file00 onwards in that same order using PySpark?

twSoulz
  • 82
  • 1
  • 9
  • Do you want to read all the parquet at the same time? – Jonathan Lam Aug 11 '22 at 03:05
  • @Jonathan Reading them same time or one by one is not the main issue for me now. The files are all in the same directory. I want to read all those parquet files and save them to one single parquet file/dataframe using Pyspark. I had done the same using pandas, but I don't want to use pandas as it takes too much time for large files. – Ashish Padhi Aug 11 '22 at 03:08
  • Does this answer your question? [How to append multiple parquet files to one dataframe in Pandas](https://stackoverflow.com/questions/59164709/how-to-append-multiple-parquet-files-to-one-dataframe-in-pandas), [Reading DataFrame from partitioned parquet file](https://stackoverflow.com/questions/33650421/reading-dataframe-from-partitioned-parquet-file), https://stackoverflow.com/questions/58240979/how-can-i-read-multiple-parquet-files-in-spark-scala – Azhar Khan Aug 11 '22 at 03:34
  • This also answers another question I had, thank you! – Ashish Padhi Aug 11 '22 at 04:04

1 Answers1

1

As you mentioned that all parquet files are in the same directory and they have the same schema, then you can read all the parquet by:

file_0_path = /root/to/data/file00.parquet
file_1_path = /root/to/data/file01.parquet
....

df = spark.read.parquet("/root/to/data/")

If you want to save them in a single parquet, you can:

df.repartition(1).write.save(save_path, format='parquet)
Jonathan Lam
  • 1,761
  • 2
  • 8
  • 17
  • So the read.parquet will read all the files from that folder will read all the files in that order automatically? That is interesting. I will try this out! Thank you. My main goal is to convert the final parquet file to a .hyper type file. This will be very helpful to concatenate them all into one single parquet file before conversion. – Ashish Padhi Aug 11 '22 at 03:19
  • @AshishPadhi Yes you can because your current directory structure is exactly the same of how partitioning work. If you want to make sure that your record is ordered, you better use `orderBy()` before you write the new parquet. – Jonathan Lam Aug 11 '22 at 03:44
  • Hi, I am getting the following error: `Py4JJavaError: An error occurred while calling o25.parquet.` There are more messages in the error, but I am not able to copy paste the whole thing here. – Ashish Padhi Aug 11 '22 at 05:24
  • Hi @AshishPadhi, could you edit your own post or open a new post? It's impossible to debug your error based on this error log – Jonathan Lam Aug 11 '22 at 05:58
  • https://stackoverflow.com/questions/73316032/error-in-reading-multiple-parquet-files-with-same-schema-with-pyspark I have made a new post. – Ashish Padhi Aug 11 '22 at 06:14