3

I'm using insert into command to load data from txt table to RC table. The destination table (RC) is partitioned table, therefore, the dynamic partition is enabled. At the end of the insert into command I have multiple small files in each partition... I tried to set few Hive parameters for merge, but the result is more or less the same...

The only thing that worked for me is when I added order by [any column] to insert into command In that case, there is a reduce process which cause to have eventually a single file in each partition.

This is like ugly workaround, I'm looking for a more elegant way.

Any suggestions?

Thanks

Acorn
  • 24,970
  • 5
  • 40
  • 69
sharon
  • 51
  • 3
  • Which parameters have you tried? Check if this helps https://stackoverflow.com/a/59496778/12602787 – damientseng Sep 20 '20 at 11:48
  • Thank you, I tried most of the parameters suggested in the link you shared. Any idea which value need to be configured ? – sharon Sep 20 '20 at 15:10

1 Answers1

0

Try adding DISTRIBUTE BY <partition key(list)> instead of ORDER BY, it will group data by partition key and final reducers will process single partition each instead of writing each partition files per reducer process and will work faster than ORDER BY.

leftjoin
  • 36,950
  • 8
  • 57
  • 116