Rename Hadoop server tables in pyspark/Spark API in python

Question

for elem in list:
    final = sqlCtx.read.table('XXX.YYY')
    interim = final.join(elem,'user_id', "fullouter")
    final = interim.select(['user_id'] + [
    spark_combine_first(final[c], elem[c]).alias(c) for c in dup_collect(interim.columns)[0] if c not in ['user_id']] + \
    [c for c in dup_collect(interim.columns)[1] if c not in ['user_id']])

final.write.mode("overwrite").saveAsTable("XXX.temp_test")
final2 = sqlCtx.read.table('XXX.temp_test')

final2.write.mode("overwrite").saveAsTable("XXX.YYY")

This is my mock code, as you can see I am reading from a table and then finally writing to the same table on Hadoop servers, but I get an error that the table can't be overwritten when reading from the same table.

I have found a temporary work around for it (by writing to a temporary table and then, import it to a new DataFrame and finally write to the required table) but, this seems like super inefficient.

I was hoping for another approach by which I can simply rename the temp_table created from within the spark API but have not found much success.

PS: Please ignore the indentation, I cant seem to get the right formatting here.

Is your sqlCtx (an object of HiveContext)? If no, can you try creating a HiveContext. What type of input file is this? — pvy4917, Oct 17 '18 at 14:40
You can try using check pointing. Refer the answer by **nsanglar**: https://stackoverflow.com/questions/38746773/read-from-a-hive-table-and-write-back-to-it-using-spark-sql — pvy4917, Oct 17 '18 at 14:44

Rename Hadoop server tables in pyspark/Spark API in python

0 Answers0