0

I am trying to build a Hive table and automate it through oozie. Data in the table need not be older than last 30 days.

Action in the work flow would be run every day. It will first purge data that are 30 days older, and insert data for today. Sliding window with 30 days interval.

Can someone show an example how to achieve this?

i5z
  • 15
  • 6
  • Table is a data table on hadoop, created with hive query. It pulls data from two other tables. – i5z Nov 25 '15 at 19:10

2 Answers2

0

As already answered in How to delete and update a record in Hive (by user ashtonium), hive version 0.14 is available with ACID support. So, create .hql script and use simple DELETE + where condition (for current date use unix_timestamp()) and INSERT. The INSERT should be used in a bulk fashion not in a OLTP one.

Community
  • 1
  • 1
Bechyňák Petr
  • 805
  • 9
  • 14
0

Hive stores data in HDFS files, and these files are immutable.

Well, actually, with recent Hadoop releases HDFS files can be appended to, or even truncated, but with a low-level API, and Hive has no generic way to modify data files for text/AVRO/Parquet/ORC/whatever format, so for practical purposes HDFS files are immutable for Hive.

One workaround is to use transactional ORC tables that create/rewrite an entire data file on each "transaction" -- requiring a background process for periodic compaction of the resulting mess (e.g. another step of rewriting small files into bigger files).

Another workaround would be an ad hoc batch rewrite of your table whenever you want to get rid of older data -- e.g. every 2 weeks, run a batch that removes data older than 30 days.

> simple design

  • make sure that you will have no INSERT or SELECT running until the purge is over
  • create a new partioned table with the same structure plus a dummy partitioning column
  • copy all the data-to-keep to that dummy partition
  • run an EXCHANGE PARTITION command
  • drop the partitioned table
  • now the older data is gone, and you can resume INSERTs

> alternative design, allows INSERTs while purge is running

  • rebuild your table with a dummy partitioning key, and make sure that all INSERTs always go into "current" partition
  • at purge time, rename "current" partition as "to_be_purged" (and make sure that you will run no SELECT until purge is over, otherwise you may get duplicates)
  • copy all the data-to-keep from "to_be_purged" to "current"
  • drop partition "to_be_purged"
  • now the older data is gone

But it would be soooooooooooooooooo much simpler if your table was partitioned by month, in ISO format (i.e. YYYY-MM). In that case you could just get the list of partitions and drop all that have a key "older" than (current month -1), with a plain bash script. Believe me, it's simple and rock-solid.

Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
  • I finally came up with a solution where purge script runs before insert. Purge drops a partition where date < 30 days, then insert adds partition for current date. – i5z Dec 08 '15 at 23:05