5

In my spark job, I tried to overwrite a table in each microbatch of structured streaming

batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable")

It generated the following error.

  Can not create the managed table('`mytable`'). The associated location('file:/home/ec2-user/environment/spark/spark-local/spark-warehouse/mytable') already exists.;

I knew in Spark 2.xx, the way to solve this issue is to add the following option.

spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

It works well in spark 2.xx. However, this option was removed in Spark 3.0.0. Then, how should we solve this issue in Spark 3.0.0?

Thanks!

yyuankm
  • 295
  • 4
  • 22
  • 2
    Please try to explicitly specify the path where you're going to save with the 'overwrite' mode. – John Thomas Sep 19 '20 at 17:33
  • 4
    Thanks John, I can confirm it works by adding a path in Spark 3.0. The way I add the path is as following. ```batchDF.write.mode(SaveMode.Overwrite).option("path", "/home/ec2-user/environment/spark/spark-local/tmp").saveAsTable("mytable")```. I am deploying in the standalone mode. Do you also have some comments on what is the correct path to use if I want to deploy it into hadoop cluster? Thanks! – yyuankm Sep 21 '20 at 23:56
  • does the "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true" also deletes the remaining files? otherwise, you might get a mix of old and new files. – Hanan Shteingart Jan 28 '21 at 12:28
  • 1
    I can confirm that this works: I am loading json format local hive tables in integration tests and have specified the same directory that was being used by default ( which is in my ide): now it doesn't fail if the files already exist. – stephen newman Feb 19 '21 at 01:52
  • @yyuankm Did you find a solution for this issue ? – Abdennacer Lachiheb Jun 24 '22 at 09:51

1 Answers1

0

It looks like you run your test data generation and your actual test in the same process - can you just replace these with createOrReplaceTempView to save them to Spark's in-memory catalog instead of into a Hive catalog?

Something like : batchDF.createOrReplaceTempView("mytable")