1

I have a PySpark DataFrame. I want to perform some function forearchPartition and then save each result to Hive. The result is a pandas dataframe (within each partition). What is the best way to do this?

I have tried the following without success (gives a serialization error):

def processData(x):
    #do something
    spark_df = spark.createDataFrame(pandas_df)
    spark_df.write.mode("append").format("parquet").saveAsTable(db.table_name)

original_spark_df.rdd.forearchPartition(processData)

I guess, one solution would be to turn pandas into RDD and return it (using mapPartitions instead of forearchPartition), and then use rdd.toDF() and saveAsTable().

Is there some solution to save the pandas to Hive within forearchPartition?

Stergios
  • 3,126
  • 6
  • 33
  • 55
  • [How to save a huge pandas dataframe to hdfs?](https://stackoverflow.com/q/47393001/9613318) – Alper t. Turker May 07 '18 at 09:45
  • If I just save pandas as parquet files, how will Hive update it's metadata? – Stergios May 07 '18 at 11:16
  • `MSCK REPAIR TABLE` (Hive) / `REFRESH TABLE` (Spark). What you describe in the question is just not allowed (to be honest I am not sure why would you choose this design). – Alper t. Turker May 07 '18 at 11:22
  • I want to run a Machine Learning task with each partition's data and save some output to Hive. I'm fine with transforming pandas DF to an intermediate format before saving to Hive. – Stergios May 07 '18 at 11:38
  • So why not convert the result of `mapPartitions` back to Spark `DataFrame` for saving. Or just use `pandas_udf`? – Alper t. Turker May 07 '18 at 11:49
  • I actually want to save two dataframes within each partition. That's why I don't want to return back to Spark DF. – Stergios May 07 '18 at 11:57
  • 1
    In that case, the linked solution (or a similar one, like using one of Python Hive clients ) is the only choice you have. – Alper t. Turker May 07 '18 at 12:05

0 Answers0