0

I tried a Hive process, which generate words frequency rank from sentences, I would like to output not multiple files but one file.

I searched the similar question this web site, I found mapred.reduce.tasks=1, but it didn't generate one file but 50 files.

The process l tried has 50 input files and they are all gzip file.

How do I get one merged file? 50 input files size is so large that I suppose the reason may be some kind of limit.

  • `mapred.reduce.tasks=1` didn't work probably because you don't have reduce jobs. you can induce reduce jobs by adding for example `sort by` to your Hive query – serge_k Aug 27 '18 at 14:17
  • What is the reason in generating one file? Is because your program is fully distributed, each container creates it's own file independently. Single reducer will kill parallelism. Later you can also read them in parallel. Hive table also can read many files in it's location. You can concatenate them using cat command, better do not use zip to make this easier – leftjoin Aug 27 '18 at 15:43
  • Check Hive properties under "merge small files" _(where "small" is configurable)_ https://stackoverflow.com/questions/47272492/why-does-a-map-only-job-in-hive-results-in-a-single-output-file – Samson Scharfrichter Aug 27 '18 at 18:14

2 Answers2

1

in your job use Order By clause with some field.

So that hive will enforce to run only one reducer as a result you are going to end up with one file has created in the HDFS.

hive> Insert into default.target 
         Select * from default.source
      order by id;

For more details regards to order by clause refer to this and this links.

notNull
  • 30,258
  • 4
  • 35
  • 50
0

Thank you for your kind answers, you are really saving me. I am trying order by, but it is taking much time, i am waiting for it. All I have to do is get one file to make output file into input of the next step, I am also going to try simply cat all files from reducer outputs according to the advice, if I will do it, I am worried that files are unique and does not have any the same word between files , and whether it is normal gzip file made by catting multiple gzip files.