Oozie workflow: how to keep most recent 30 days in the table

Question

I am trying to build a Hive table and automate it through oozie. Data in the table need not be older than last 30 days.

Action in the work flow would be run every day. It will first purge data that are 30 days older, and insert data for today. Sliding window with 30 days interval.

Can someone show an example how to achieve this?

Table is a data table on hadoop, created with hive query. It pulls data from two other tables. — i5z, Nov 25 '15 at 19:10

score 0 · Answer 1 · edited May 23 '17 at 10:27

0

As already answered in How to delete and update a record in Hive (by user ashtonium), hive version 0.14 is available with ACID support. So, create .hql script and use simple DELETE + where condition (for current date use unix_timestamp()) and INSERT. The INSERT should be used in a bulk fashion not in a OLTP one.

edited May 23 '17 at 10:27

Community

1
1

answered Nov 25 '15 at 20:33

Bechyňák Petr

805
9
14

score 0 · Accepted Answer · answered Nov 26 '15 at 14:17

Hive stores data in HDFS files, and these files are immutable.

Well, actually, with recent Hadoop releases HDFS files can be appended to, or even truncated, but with a low-level API, and Hive has no generic way to modify data files for text/AVRO/Parquet/ORC/whatever format, so for practical purposes HDFS files are immutable for Hive.

One workaround is to use transactional ORC tables that create/rewrite an entire data file on each "transaction" -- requiring a background process for periodic compaction of the resulting mess (e.g. another step of rewriting small files into bigger files).

Another workaround would be an ad hoc batch rewrite of your table whenever you want to get rid of older data -- e.g. every 2 weeks, run a batch that removes data older than 30 days.

> simple design

make sure that you will have no INSERT or SELECT running until the purge is over
create a new partioned table with the same structure plus a dummy partitioning column
copy all the data-to-keep to that dummy partition
run an EXCHANGE PARTITION command
drop the partitioned table
now the older data is gone, and you can resume INSERTs

> alternative design, allows INSERTs while purge is running

rebuild your table with a dummy partitioning key, and make sure that all INSERTs always go into "current" partition
at purge time, rename "current" partition as "to_be_purged" (and make sure that you will run no SELECT until purge is over, otherwise you may get duplicates)
copy all the data-to-keep from "to_be_purged" to "current"
drop partition "to_be_purged"
now the older data is gone

But it would be soooooooooooooooooo much simpler if your table was partitioned by month, in ISO format (i.e. YYYY-MM). In that case you could just get the list of partitions and drop all that have a key "older" than (current month -1), with a plain bash script. Believe me, it's simple and rock-solid.

I finally came up with a solution where purge script runs before insert. Purge drops a partition where date < 30 days, then insert adds partition for current date. — i5z, Dec 08 '15 at 23:05

Oozie workflow: how to keep most recent 30 days in the table

2 Answers2