How to append multiple parquet files to one dataframe in Pandas

Question

I am working on decompressing snappy.parquet files with Spark and Pandas. I have 180 files (7GB of data in my Jupyter notebook). In my understanding, I need to create a loop to grab all the files - decompress them with Spark and append to Pandas table? Here is the code

findspark.init()

import pyspark 

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

parquetFile = spark.read.parquet("file_name.snappy.parquet")

parquetFile.createOrReplaceTempView("parquetFile")
file_output = spark.sql("SELECT * FROM parquetFile")
file_output.show()

pandas_df = file_output.select("*").toPandas()

This part works and I have my Pandas dataframe from one file, and I have another 180 files that I need to append to the pandas_df. Can anyone help me out? Thank you!

Does this answer your question? [Reading parquet files from multiple directories in Pyspark](https://stackoverflow.com/questions/37257111/reading-parquet-files-from-multiple-directories-in-pyspark) — Shaido, Dec 04 '19 at 05:49

score 2 · Accepted Answer · answered Dec 03 '19 at 22:17

2

With Spark you can load a dataframe from a single file or from multiple files, only you need to replace your path of your single for a path of your folder (assuming that all of your 180 files are in the same directory).

parquetFile = spark.read.parquet("your_dir_path/")

answered Dec 03 '19 at 22:17

Cesar A. Mostacero

720
6
12

Thank you! That worked. I want to now convert it to Pandas df so it is easier for me to perform the queries. Any idea how can I make that happen? I tried the following ```parquetFile_all.createOrReplaceTempView("parquetFile_all") file_output_all = spark.sql("SELECT _o FROM parquetFile_all") file_output_all.show()``` And that gave me an error. – Chique_Code Dec 04 '19 at 16:59
Could you please share the error? You can take a look into this question about how to use `createOrReplaceTempView`: https://stackoverflow.com/questions/44011846/how-does-createorreplacetempview-work-in-spark. Also, depending on the complexity of the queries, you can do it directly on the `DF` using `.select` / `.filter` functions. Examples: https://stackoverflow.com/questions/42409756/where-clause-not-working-in-spark-sql-dataframe. – Cesar A. Mostacero Dec 04 '19 at 17:58
My queries will be relatively complex, I will use Regex for it, and I am not sure how to perform that with the Spark df. Also, I will need to share the data with my colleague, so ideally I wanted to convert it to csv file afterwards. Here is the link to the error I have created another question in SO. https://stackoverflow.com/questions/59181687/convert-spark-read-parquet-into-pandas-dataframe – Chique_Code Dec 04 '19 at 18:05
1

I saw the new question that you posted, here is my recommendation: 1- Output Data seems to be really huge, this is an inconvenient in two parts: processing (based on you are using pandas) and saving it (if you want to share it with other people). 2 - In order to process it, don't use pandas and use `spark-sql` (`createOrReplaceTempView `) and do your queries over the tmp table. 3 - In order to share the output, save it as `csv` (`DF.write.option("header", "true").csv(OUTPUT_PATH)`), but don't collect it in one single file to avoid same error happen. – Cesar A. Mostacero Dec 04 '19 at 20:27
Fantastic! Great help. Thank you very much! – Chique_Code Dec 04 '19 at 20:29

How to append multiple parquet files to one dataframe in Pandas

1 Answers1

Linked

Related