I don't think this is possible case to append data to the existing file.
But you can work around this case by using either of these ways
Approach1
Using Spark, write to intermediate temporary table and then insert overwrite to final table:
existing_df=spark.table("existing_hive_table") //get the current data from hive
current_df //new dataframe
union_df=existing_df.union(current_df)
union_df.write.mode("overwrite").saveAsTable("temp_table") //write the data to temp table
temp_df=spark.table("temp_table") //get data from temp table
temp_df.repartition(<number>).write.mode("overwrite").saveAsTable("existing_hive_table") //overwrite to final table
Approach2:
Hive(not spark)
offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition:
Orc format
will offer alter table concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;
We can also use distributeby,sortby clauses
to control number of files, refer this and this link for more details.
Another Approach3 is by using hadoop fs -getMerge to merge all small files into one (this method works
for text files
and i haven't tried
for orc,avro ..etc formats).