I have several parquets in a similar folder structure:
'/raw-files/17001/result.parquet'
'/raw-files/17002/result.parquet'
'/raw-files/...../result.parquet'
'/raw-files/18000/result.parquet'
I want to combine all of the parquets into one DataFrame while adding a column using the unique folder name (17001, 17002, ....., 18000) as a key to discern between them. So far I have
raw_files=os.listdir('raw-files')
to create a list of all the unique folder names and then create a dictionary of DataFrames by looping through those directories and reading the parquets.
df_dict = {}
for folder in raw_files:
path = 'raw-files/' +folder+'/'
df_dict[folder] = spark.read.parquet(path +'results.parquet').withColumn('Key',lit(folder))
So now I have a dictionary of Spark DataFrames with the desired Key column, but I'm not sure how to reduce them to a single DataFrame. I know there's ways to do it with Pandas, but I'd like to stick with the Spark framework. There could also be an easier way to do this in Spark and I'm just overlooking it.