Pyspark: Load similar parquets from different directories and combine into one DataFrame with the folder name as a column

Question

I have several parquets in a similar folder structure:

'/raw-files/17001/result.parquet'
'/raw-files/17002/result.parquet'
'/raw-files/...../result.parquet'
'/raw-files/18000/result.parquet'

I want to combine all of the parquets into one DataFrame while adding a column using the unique folder name (17001, 17002, ....., 18000) as a key to discern between them. So far I have

raw_files=os.listdir('raw-files')

to create a list of all the unique folder names and then create a dictionary of DataFrames by looping through those directories and reading the parquets.

df_dict = {}
for folder in raw_files:
    path = 'raw-files/' +folder+'/' 
    df_dict[folder] = spark.read.parquet(path +'results.parquet').withColumn('Key',lit(folder))

So now I have a dictionary of Spark DataFrames with the desired Key column, but I'm not sure how to reduce them to a single DataFrame. I know there's ways to do it with Pandas, but I'd like to stick with the Spark framework. There could also be an easier way to do this in Spark and I'm just overlooking it.

@mck, I saw that before, but it just gave me an idea that worked. Instead of using a dictionary to store the dataframes I can use a list and the solution linked above works. — user3304359, Feb 04 '21 at 15:42

score 0 · Answer 1 · answered Feb 04 '21 at 15:46

Instead of storing the dataframes in a dictionary, I used a list instead.

df_dict = []
for folder in raw_files:
    path = 'raw-files/' +folder+'/' 
    df_dict.append(spark.read.parquet(path +'results.parquet').withColumn('Key',lit(folder)))

From there, I can use the solution mck linked

df = reduce(DataFrame.unionAll, df_dict)

If anyone has a more efficient way of doing this, let me know!

Pyspark: Load similar parquets from different directories and combine into one DataFrame with the folder name as a column

1 Answers1