0

I have data to insert to hive each week,for example,1st week I insert:

spark.range(0,2).write.mode("append").saveAsTable("batches")

and 2nd week I insert:

spark.range(2,4).write.mode("append").saveAsTable("batches")

I am worry when the record which id is 2 is inserted,some exception occur,3 is not inserted,then I insert the data of 2nd week again,there will be two 2.

I googled, hive is not suit for delete particular set of records:link,so I can not delete the data left before the exception of 2nd week.

I think I can use hive transaction,but spark is not right now (2.3 version) fully compliant with hive transactional tables. ,seems even we can not read hive if we enable hive transaction.

And I see another website:Hive's ACID feature (which introduces transactions) is not required for inserts, only updates and deletes. Inserts should be supported on a vanilla Hive shell.,but I do not why transaction is not needed for insert,and what does it mean.

So,if I do not want to get the duplicated data,or like run at most once,what shall I do?

If you say there won`t be an exception between 2 and 3,but consider another case,if I have many tables to write:

   spark.range(2,4).write.mode("append").saveAsTable("batches")
   val x=1/0
   spark.range(2,4).write.mode("append").saveAsTable("batches2")

I tested it,new records 2 and 3 have been inserted into table "batches",but not inserted into "batches2",so if I want to insert again,must I only insert "batches2"?but how can I know where is the exception,and which tables should I insert again?I must to insert many try and catch,but it makes code hard to read and write.And what about the exception is disk is full or power off?

How to prevent duplicated data?

hometown
  • 279
  • 2
  • 17
  • before writing the current batch you can load the previous batch and do the `except` operation and write to the table. I assume batch data you are using may not be too huge. eg. `final df = curr_df.except(prev_df)` – Sathiyan S Dec 04 '18 at 10:27
  • 1
    may not be huge,but I wonder what will happen if data is huge? and in real work,I shall use `final df = curr_df.drop("id").except(prev_df)` because id is increased by 1, the first id of current new batch is one plus max id of previous batch. any better way?@SathiyanS – hometown Dec 05 '18 at 02:40
  • You are correct this is how I have achieved in a project. you can try your way. if you found any better way share it... @hometown :) – Sathiyan S Dec 05 '18 at 05:35

0 Answers0