2

I am trying to merge multiple small ORC files. Came across ALTER TABLE CONCATENATE command but that only works for managed tables.

Hive gave me the following error when I try to run it :

FAILED: SemanticException org.apache.hadoop.hive.ql.parse.SemanticException: Concatenate/Merge can only be performed on managed tables

Following are the table parameters :

Table Type:             EXTERNAL_TABLE
Table Parameters:
    COLUMN_STATS_ACCURATE   true
    EXTERNAL                TRUE
    numFiles                535
    numRows                 27051810
    orc.compress            SNAPPY
    rawDataSize             20192634094
    totalSize               304928695
    transient_lastDdlTime   1512126635

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
    serialization.format    1
Rahul Sharma
  • 5,614
  • 10
  • 57
  • 91
vatsal
  • 109
  • 2
  • 12
  • You need to merge into a separate table – OneCricketeer Dec 14 '17 at 02:15
  • As [reported by **@leftjoin**](https://stackoverflow.com/questions/50042225/how-do-i-combine-or-merge-small-orc-files-into-larger-orc-file#comment87146112_50042837), I can confirm that `ALTER TABLE .. PARTITION .. CONCATENATE` worked for my `Hive` `EXTERNAL` table (data stored on `S3`). I am using `Hive 2.3.4` on `emr-5.20.0` [I insert data via `Spark`] – y2k-shubham Feb 21 '19 at 12:26

2 Answers2

3

I believe your table is an external table,then there are two ways:

  1. Either you can change it to Managed table (ALTER TABLE <table> SET TBLPROPERTIES('EXTERNAL'='FALSE') and run the ALTER TABLE CONCATENATE.Then you can convert the same back to external changing it to TRUE.
  2. Or you can create a managed table using CTAS and insert the data. Then run the merge query and import the data back to external table
Subash
  • 887
  • 1
  • 8
  • 19
  • Any reason why we should set external=false before running ALTER CONCATENATE? – Pruthvi Chitrala Apr 25 '19 at 10:10
  • The whole query is related to that.As the user mentioned,he is facing error because its an external table for executing the Alter command. So you need to change it to Managed and then run alter command. – Subash Apr 25 '19 at 10:19
  • Thanks for quick response. I also have similar use case to merge ORC files on external table. But, I was able run(without error) alter concatenate query without setting external=false. But concatenation of small files is not behaving as expected with external table. So, wanted to know the reason behind setting table as external=false or managed table. – Pruthvi Chitrala Apr 25 '19 at 10:29
0

From my previous answer to this question, here is a small script in Python using PyORC to concatenate the small ORC files together. It doesn't use Hive at all, so you can only use it if you have direct access to the files and are able to run a Python script on them, which might not always be the case in managed hosts.

import pyorc
import argparse


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-o', '--output', type=argparse.FileType(mode='wb'))
    parser.add_argument('files', type=argparse.FileType(mode='rb'), nargs='+')
    args = parser.parse_args()

    schema = str(pyorc.Reader(args.files[0]).schema)

    with pyorc.Writer(args.output, schema) as writer:
        for i, f in enumerate(args.files):
            reader = pyorc.Reader(f)
            if str(reader.schema) != schema:
                raise RuntimeError(
                    "Inconsistent ORC schemas.\n"
                    "\tFirst file schema: {}\n"
                    "\tFile #{} schema: {}"
                    .format(schema, i, str(reader.schema))
                )
            for line in reader:
                writer.write(line)


if __name__ == '__main__':
    main()
Antoine Pietri
  • 793
  • 1
  • 10
  • 25