0

Currently testing a cluster and when using the "CREATE TABLE AS" the resulting managed table ends up being one file ~ 1.2 GB while the base file the query is created from has many small files. The SELECT portion runs fast, but then the result is 2 reducers running to create one file which takes 75% of the run time.

Additional testing:

1) If using "CREATE EXTERNAL TABLE AS" is used the query runs very fast and there is no merge files step involved.

2) Also, the merging doesn't appear to occur with version HDP 3.0.1.

vikrant rana
  • 4,509
  • 6
  • 32
  • 72
rbigley
  • 103
  • 1
  • 7
  • 1
    Does this answer your question? [Specify minimum number of generated files from Hive insert](https://stackoverflow.com/questions/55372028/specify-minimum-number-of-generated-files-from-hive-insert) – leftjoin Nov 09 '19 at 05:41
  • 1
    Also read this: https://stackoverflow.com/a/38475807/2700344 Also read this about using rand() in the distribute by: https://stackoverflow.com/a/58509772/2700344 – leftjoin Nov 09 '19 at 05:48

1 Answers1

0

You can change set hive.exec.reducers.bytes.per.reducer=<number> to let hive decide number of reducers based on reducer input size (default value is set to 1 GB or 1000000000 bytes ) [ you can refer to links provided by @leftjoin to get more details about this property and fine tuning for your needs ]

Another option you can try is to change following properties

set mapreduce.job.reduces=<number>
set hive.exec.reducers.max=<number>

Naga
  • 416
  • 3
  • 11