How to save pandas DataFrame to Hive within `forearchPartition` (PySpark)

Question

I have a PySpark DataFrame. I want to perform some function forearchPartition and then save each result to Hive. The result is a pandas dataframe (within each partition). What is the best way to do this?

I have tried the following without success (gives a serialization error):

def processData(x):
    #do something
    spark_df = spark.createDataFrame(pandas_df)
    spark_df.write.mode("append").format("parquet").saveAsTable(db.table_name)

original_spark_df.rdd.forearchPartition(processData)

I guess, one solution would be to turn pandas into RDD and return it (using mapPartitions instead of forearchPartition), and then use rdd.toDF() and saveAsTable().

Is there some solution to save the pandas to Hive within forearchPartition?

[How to save a huge pandas dataframe to hdfs?](https://stackoverflow.com/q/47393001/9613318) — Alper t. Turker, May 07 '18 at 09:45
If I just save pandas as parquet files, how will Hive update it's metadata? — Stergios, May 07 '18 at 11:16
`MSCK REPAIR TABLE` (Hive) / `REFRESH TABLE` (Spark). What you describe in the question is just not allowed (to be honest I am not sure why would you choose this design). — Alper t. Turker, May 07 '18 at 11:22
I want to run a Machine Learning task with each partition's data and save some output to Hive. I'm fine with transforming pandas DF to an intermediate format before saving to Hive. — Stergios, May 07 '18 at 11:38
So why not convert the result of `mapPartitions` back to Spark `DataFrame` for saving. Or just use `pandas_udf`? — Alper t. Turker, May 07 '18 at 11:49
I actually want to save two dataframes within each partition. That's why I don't want to return back to Spark DF. — Stergios, May 07 '18 at 11:57
In that case, the linked solution (or a similar one, like using one of Python Hive clients ) is the only choice you have. — Alper t. Turker, May 07 '18 at 12:05

How to save pandas DataFrame to Hive within `forearchPartition` (PySpark)

0 Answers0