3

This is my first week with Hive and HDFS, so please bear with me.

Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.

But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.

The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?

LearneR
  • 2,351
  • 3
  • 26
  • 50
  • Possible duplicate of [Create Table in Hive with one file](https://stackoverflow.com/questions/45265339/create-table-in-hive-with-one-file) – leftjoin Jun 16 '19 at 15:17
  • When writing an answer I noticed that all solutions are already described in another answer. https://stackoverflow.com/q/45265339/2700344 – leftjoin Jun 16 '19 at 15:19

1 Answers1

1

Two possible methods other than ALTER TABLE CONCATENATE:

  1. Try to configure merge task, see details here: https://stackoverflow.com/a/45266244/2700344

  2. Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:

    INSERT OVERWRITE TABLE
    SELECT * FROM TABLE
      ORDER BY some_col; --this will force single reducer

As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

leftjoin
  • 36,950
  • 8
  • 57
  • 116