Hive stores data in HDFS files, and these files are immutable.
Well, actually, with recent Hadoop releases HDFS files can be appended to, or even truncated, but with a low-level API, and Hive has no generic way to modify data files for text/AVRO/Parquet/ORC/whatever format, so for practical purposes HDFS files are immutable for Hive.
One workaround is to use transactional ORC tables that create/rewrite an entire data file on each "transaction" -- requiring a background process for periodic compaction of the resulting mess (e.g. another step of rewriting small files into bigger files).
Another workaround would be an ad hoc batch rewrite of your table whenever you want to get rid of older data -- e.g. every 2 weeks, run a batch that removes data older than 30 days.
> simple design
- make sure that you will have no INSERT or SELECT running until the purge is over
- create a new partioned table with the same structure plus a dummy
partitioning column
- copy all the data-to-keep to that dummy partition
- run an EXCHANGE PARTITION command
- drop the partitioned table
- now the older data is gone, and you can resume INSERTs
> alternative design, allows INSERTs while purge is running
- rebuild your table with a dummy partitioning key, and make sure that
all INSERTs always go into "current" partition
- at purge time, rename "current" partition as "to_be_purged" (and
make sure that you will run no SELECT until purge is over, otherwise
you may get duplicates)
- copy all the data-to-keep from "to_be_purged" to "current"
- drop partition "to_be_purged"
- now the older data is gone
But it would be soooooooooooooooooo much simpler if your table was partitioned by month, in ISO format (i.e. YYYY-MM). In that case you could just get the list of partitions and drop all that have a key "older" than (current month -1), with a plain bash script. Believe me, it's simple and rock-solid.