I have a PySpark DataFrame. I want to perform some function forearchPartition
and then save each result to Hive. The result is a pandas dataframe (within each partition). What is the best way to do this?
I have tried the following without success (gives a serialization error):
def processData(x):
#do something
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.mode("append").format("parquet").saveAsTable(db.table_name)
original_spark_df.rdd.forearchPartition(processData)
I guess, one solution would be to turn pandas into RDD and return it (using mapPartitions
instead of forearchPartition
), and then use rdd.toDF()
and saveAsTable()
.
Is there some solution to save the pandas to Hive within forearchPartition
?