1

First of all I want to apologize because I do not have the vocabulary to talk about hive properly, I'm not sure if what goes into a row is called data and so on, I'm trying to be as correct as possible.

I want to know if it's possible, without adding an extra column to a hive table (where you would put the date/some metadata), what where the new rows added.

The case is as follows: A very large number of data is going to be processed, and the data selected ends in another hive table. If some new data is added to the original tables, I want to only process that new data, not to re-process the whole process, because it seems waste(we're talking several million entries).

I would normally add a new column with dates, or just metadata that tells me whether or not a row was already "computed" with.

edit: I have been updated with more info. Turns out, there are actually two problems, imo.

One, new data may come, and it would be infinitely better to just insert thus new ones in the destination table.

Second, data might be updated. I've been told that hive does not allow updates in the normal sense, since for example insert overwrite would just rewrite the whole set (turns out it's Hive 0.12.0, and in 0.14 SOME functionality has been added but updating is not a possibility).

monkey intern
  • 705
  • 3
  • 14
  • 34
  • is it external table or internal table? If its external table like hbasestoragehandler it will maintain time stamp of when the record was inserted.. But if its only hive table I dont think its possible. – Ram Ghadiyaram Jun 14 '16 at 15:39
  • This may help: http://stackoverflow.com/questions/37709411/hive-best-way-to-do-incremetal-updates-on-a-main-table/37744071#37744071 – leftjoin Jun 14 '16 at 19:56
  • add etl_create_datetm and etl_update_datetm fields and set them during insert overwrite – leftjoin Jun 15 '16 at 13:09

0 Answers0