1

I am using hive ,

I have 24 json files with total size of 300MB (in one folder), so I have created one external table(i.e table1) and I loaded the data(i.e 24 files ) Into external table.

When I am running select query on top of that external table(i.e table1), I observed 3 mappers and 1 reducer is running.

After that I have created one more external table(i.e table2).

I have compressed the my input files (folder which contains 24 files ).

Example : BZIP2

So it compress the data but 24 files created with extension “.BZiP2” (i.e..file1.bzp2,…..file24.bzp2).

After that , I have load the my compressed files into my external table .

Now, when I am running select query , it is taking 24 mappers and 1 reducer. And observed CPU time is taking more time when compared to uncompressed data(i.e files) .

How can I reduce number of mappers, if data is in compressed format(i.e table2 select query )?

How can I reduce CPU time , if data is in compressed format(i.e table2 select query )? How CPU time will affect performance?

Sai
  • 1,075
  • 5
  • 31
  • 58

3 Answers3

1

The number of mappers can be less than the number of files only if files are on the same data node. If files are located on different datanodes, the number of mappers will never be less than the number of files. Concatenate all /some files and put them into your table location. use cat command for concatenating non-compressed files. You got 24 mappers because you have 24 files.Parameters mapreduce.input.fileinputformat.split.minsize / maxsize are for splitting bigger files.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • More mappers in parallel - more performance. But if there are too many mappers, say thousands or more - some of them will be not executing(pending), they will be waiting in queue for free slots. That is why the performance may degrade – leftjoin Jul 20 '16 at 09:29
  • 24 mappers is not too many for Big Data. It depends on your cluster/database size, for some clusters 24K or more mappers is OK – leftjoin Jul 20 '16 at 09:42
  • Hi Thanks for your reply, I have created partition table year,month,day. Actually I am getting data hourly base so 24 hours , 24 file (total size of 1GB), then I am doing the compression and loading into the hive external table. As per your point above, now I am running yesterday date, so we have 24 (zipped files) ,so when we running 24 mappers are executing. Suppose In future if I am running select query for 1 month ,so my total input files will be 24*30=720 , so 720 mappers will execute. – Sai Jul 20 '16 at 10:22
  • so my doubt is it may create performance problem in future. One more doubt I have here is same data without compression, if I execute it is taking only 6 mappers, I am not sure how it is happing? Please let me know how can I Handle this situation i.e.. I want to compress my data at the same time I want to reduce my number of mappers ? – Sai Jul 20 '16 at 10:22
  • Possibly some of your uncompressed files are on the same nodes and mappers are re-used for reading few files. Compressed text files require separate mapper for each and they are not splittable. – leftjoin Jul 20 '16 at 10:34
  • I suggest you to change file format to some that can be merged automatically by Hive: sequence or ORC. Then use merge during insert. All you files inside partition folder will be merged. – leftjoin Jul 20 '16 at 10:35
  • Sorry I forgot that you are not insert, you are copying files, right? Then you can insert overwrite target table partition from some stage table and files will be merged. target-ORC, stage- your text files – leftjoin Jul 20 '16 at 10:38
  • To switch on mmerge you can use these sets: set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.merge.size.per.task=500000000; set hive.merge.smallfiles.avgsize=500000000; When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files set hive.merge.tezfiles=true; - if you are using TEZ. Or if target table is ORC than you can alternatively merge files using ALTER TABLE T [PARTITION partition_spec] CONCATENATE; - for ORC – leftjoin Jul 20 '16 at 10:40
  • Hi As per my understanding about your comment, I need to use intermediate table for merging and compression, so that I can reduce number of mappers am I right. One more doubt I have, by using “cat *.txt >> newfile.txt” and then compressing "bzip2 filename" . And loading into the table, is it fine? – Sai Jul 20 '16 at 11:57
  • 1.just put your files into intermidiate table location, no need to cat and compress, because you can delete them after successful load into target/ or before next load. 2. set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.merge.size.per.task=500000000; set hive.merge.smallfiles.avgsize=500000000; Insert overwrite table partition ... select from stg_table. And Hive will merge it for you – leftjoin Jul 20 '16 at 12:01
  • And don't forget compression settings like this: set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; set hive.exec.compress.intermediate=true; set mapred.output.compress=true; set mapred.compress.map.output=true; set hive.exec.compress.output=true; before insert overwrite. – leftjoin Jul 20 '16 at 12:05
  • sorry for asking too many questions, but here I want to compress data. But we are not performing any compression related step. So how compression will work? – Sai Jul 20 '16 at 12:10
  • Compression will be done during Insert overwrite. See my previous comment with session level compression - related params. If you will chose ORC format then you can create table with compression properties on table level and do not bother about session parameters: STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY") – leftjoin Jul 20 '16 at 12:22
  • SnappyCodec is faster than Gzip, you can use gzip for better compression. – leftjoin Jul 20 '16 at 12:26
  • Hm..but Snappy is non Splittable ,So I am using Bzip2 (Splittable). – Sai Jul 20 '16 at 12:40
  • ORC is splittable itself. Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. See this: http://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable – leftjoin Jul 20 '16 at 13:08
  • As per my understanding based on your suggestion,I have created external table "x" and then i have copied 3 files (1Kb each) into external table location. when I am executed "Select count(*) from X" it is taking 1 mapper. after that i have created one more external table "y" and then i have set the below properties. "set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.merge.size.per.task=500000000; set hive.merge.smallfiles.avgsize=500000000;set hive.merge.tezfiles=true; set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec; – Sai Jul 21 '16 at 11:05
  • set mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec; set hive.exec.compress.intermediate=true; set mapred.output.compress=true; set mapred.compress.map.output=true; set hive.exec.compress.output=true;" then I executed "INSERT OVERWRITE TABLE Y IF NOT EXISTS select * from X"; after that I have executed "Select count(*) from Y" it is taking 2 mappers. but in my external table location("Y"), I have only one file with BZIP2 format (size is 195 Bytes). here doubt is why it is taking 2 mappers? – Sai Jul 21 '16 at 11:06
  • I do not know. Look at mapper process logs and check what is it doing. 195 bytes is less than one block – leftjoin Jul 21 '16 at 12:25
  • Ok I will check .Thanks for your response. steps which ever I followed as per your suggestion are right ? do I have to follow some other steps? – Sai Jul 21 '16 at 13:01
0

If file size of 200000 bytes, setting the value of

set mapreduce.input.fileinputformat.split.maxsize=100000;
set mapreduce.input.fileinputformat.split.minsize=100000;

will trigger 200000/100000 = 2 mappers for the map reduce job

setting the value of

set mapreduce.input.fileinputformat.split.maxsize=50000;
set mapreduce.input.fileinputformat.split.minsize=50000;

will trigger 200000/50000 = 4 mappers for the the same job.

Read:

splittable-gzip

set-mappers-in-pig-hive-and-mapreduce

how-to-control-the-number-of-mappers-required-for-a-hive-query

Ronak Patel
  • 3,819
  • 1
  • 16
  • 29
  • Hi Thanks for your reply, as per your suggestion ,I have set the " set mapreduce.input.fileinputformat.split.minsize=200000000; set mapreduce.input.fileinputformat.split.maxsize=500000000; " and I execute the select query still its taking 24 mappers. i have 24 small input files of size 1GB. – Sai Jul 20 '16 at 07:52
  • yes as per leftjoin's response, It will run 1 mapper for each input file. I misunderstood you question. – Ronak Patel Jul 20 '16 at 12:40
  • It seems your answer is more correct for MR. TEZ works differently: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works – leftjoin Jul 21 '16 at 12:57
0

In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used by either:

Setting it when logged into the HIVE CLI : set tez.grouping.split-count=4 will create 4 mappers An entry in the hive-site.xml can be added via Ambari. If set via hive-site.xml HIVE will need to be restarted.