0

Hive version: 3.1.0.3.1.4.0-315 spark version: 2.3.2.3.1.4.0-315

Basically, i am trying to read transactional table data from spark. As per this page [https://stackoverflow.com/questions/50254590/how-to-read-orc-transaction-hive-table-in-spark][1], found that transactional table has to be compacted. Hence, i want to try this approach.

I am new to this and was trying compaction on delta files but it always shows "initiated" and never complete. This is happening for both Major and Minor compaction. Any help will be highly appreciated.

  1. I want to know whether is this good approach.
  2. Also, how to monitor the compaction job process other than show compactions? i can only see the line "Compaction enqueued with id 1" from the hiveserver_stdout.log.
  3. Generally, how long does this compaction takes to complete?
  4. is there any way to stop the compactions?

TIA.

[Edited]

SHOW COMPACTIONS;

+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
| compactionid  |  dbname   |    tabname     |    partname    |  type  |   state    | workerid  |  starttime  |   duration    | hadoopjobid  |
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
| CompactionId  | Database  | Table          | Partition      | Type   | State      | Worker    | Start Time  | Duration(ms)  | HadoopJobId  |
| 1             | tmp       | shop_na2       | dt=2014-00-00  | MAJOR  | initiated  |  ---      |  ---        |  ---          |  ---         |
| 2             | tmp       | na2_check      | dt=2014-00-00  | MINOR  | initiated  |  ---      |  ---        |  ---          |  ---         |
+---------------+-----------+----------------+----------------+--------+------------+-----------+-------------+---------------+--------------+
3 rows selected (0.408 seconds)

The same compactions result has been showing for past 36 hours, though retention period has been set as 86400 sec.

natarajan k
  • 406
  • 9
  • 24

1 Answers1

0

It is advised to perform this operation when the load on the cluster is less, maybe initiate over a weekend when there are less jobs running, it is a resource intensive operation and amount of time depends on the data but a moderate quantity of deltas would span multiple hours. You can use the query SHOW COMPACTIONS; to get an update on the status of compaction including the following details

Database name

Table name

Partition name

Major or minor compaction

Compaction state:

Initiated - waiting in queue

Working - currently compacting

Ready for cleaning - compaction completed and old files scheduled for removal

Thread ID

Start time of compaction