1

I am running hive on mapreduce some of the mapper are running for to long ~ 8 hrs (mostly last few number of mappers). I can see lot of [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 59 org.apache.hadoop.mapred.MapTask: Spilling map output in the logs. Need your help to tune this ?

Please find below the sample query I am running.

sample query

CREATE TABLE schema.test_t AS
SELECT
demo,
col1,
col2 as col2,
col3 as col3,
col4,
col5,
col6,
col7,
SUM(col8) AS col8,
COUNT(1) AS col9,
count(distinct col10) as col10,
col11,
col12
FROM
schema.srce_t
WHERE col13 IN ('a','b')
GROUP BY
col1,col2,col3,col4,col5,col6,col7,col11,col12
GROUPING SETS ((col1,col2,col3,col4,col5,col6,col7,col11,col12),
(col1,col11,col2,col3,col5,col6,col12,col7),
(col1,col11,col2,col3,col6,col12,col7),
(col1,col11,col2,col3,col4,col6,col12,col7),
(col1,col11,col2,col4,col5,col6,col12,col7),
(col1,col11,col2,col4,col6,col12,col7),
(col1,col11,col2,col5,col6,col12,col7),
(col1,col11,col4,col5,col6,col12,col7),
(col1,col11,col3,col4,col5,col6,col12,col7),
(col1,col11,col3,col5,col6,col12,col7),
(col1,col11,col3,col4,col6,col12,col7),
(col1,col11,col4,col6,col12,col7),
(col1,col11,col3,col6,col12,col7),
(col1,col11,col5,col6,col12,col7),
(col1,col11,col2, col6,col12,col7),
(col1,col11,col6, col12,col7));

Hive properties.

SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.java.opts=-Xmx9216m;
SET mapreduce.map.memory.mb=10240;
SET mapreduce.map.java.opts=-Xmx9216m;
SET mapreduce.task.io.sort.mb=1536

Logs:

   2019-05-15 05:34:32,600 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 714424619; bufvoid = 1073741824
2019-05-15 05:34:32,600 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452(1073741808); kvend = 232293228(929172912); length = 36142225/67108864
2019-05-15 05:34:32,600 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 750592747 kvi 187648180(750592720)
2019-05-15 05:34:41,305 INFO [main] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: RS[4]: records written - 10000000
2019-05-15 05:35:01,944 INFO [SpillThread] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
2019-05-15 05:35:07,479 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 0
2019-05-15 05:35:07,480 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 750592747 kv 187648180(750592720) kvi 178606160(714424640)
2019-05-15 05:35:34,178 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[13]: records read - 1000000
2019-05-15 05:35:58,140 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2019-05-15 05:35:58,140 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 750592747; bufend = 390854476; bufvoid = 1073741791
2019-05-15 05:35:58,140 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 187648180(750592720); kvend = 151400696(605602784); length = 36247485/67108864
2019-05-15 05:35:58,141 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 427407372 kvi 106851836(427407344)
2019-05-15 05:36:31,831 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 1
2019-05-15 05:36:31,833 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 427407372 kv 106851836(427407344) kvi 97806648(391226592)
2019-05-15 05:37:19,180 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
nilesh1212
  • 1,561
  • 2
  • 26
  • 60
  • Increase mapper parallelism:https://stackoverflow.com/a/48487306/2700344, https://stackoverflow.com/a/48487306/2700344 – leftjoin May 15 '19 at 13:54
  • @leftjoin after setting these properties my input split will be recalculated, will this impact the performance too ? mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB – nilesh1212 May 15 '19 at 14:23
  • Try to reduce min and max split size to get more smaller mappers running. BTW SQL code can also possibly be optimized, provide your code please – leftjoin May 15 '19 at 14:23
  • It is a bug it seems, second set is the same property. it should be maxsize. Try to reduce them both. Yes it will affect performance. Check you current settings and try to reduce what you have currently – leftjoin May 15 '19 at 14:26
  • @leftjoin just wanted to add one more point, only last few mappers end up with large processing time for e.g. if I have 550 mapper 450 mappers will gets finished within 4hrs and rest take around 3-4 hrs to finish . I am trying out groupset query. – nilesh1212 May 15 '19 at 15:02
  • Yep I see. It can be because some files are too big and decreasing split may help. I do not know what exactly mappers are doing. Without this knowledge it is not possible to invent recipe for sure – leftjoin May 15 '19 at 15:06
  • @leftjoin yes the data is skewed for one of the groupset column i.e. country contribute about 95% of data, using below property to handle skew dataset SET hive.groupby.skewindata=true; SET hive.optimize.skewjoin.compiletime=true; SET hive.optimize.skewjoin=true; – nilesh1212 May 15 '19 at 15:16
  • What exactly mapper does? Provide the query and EXPLAIN PLAN please. – leftjoin May 15 '19 at 15:19
  • @leftjoin sample query updated in the post. – nilesh1212 May 15 '19 at 15:37

1 Answers1

0

Сheck current values of these parameters and reduce figures until you have more mappers in parallel:

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set mapreduce.input.fileinputformat.split.minsize=16000; -- 16 KB
set mapreduce.input.fileinputformat.split.maxsize=128000000; -- 128Mb
--files bigger than max size will be splitted.
--files smaller than min size will be processed on the same mapper combined 

If your files are not in splittable format, like gzip. this will not help. Play with these settings to get more smaller mappers.

Also these settings may help to improve performance of the query

set hive.optimize.distinct.rewrite=true;
set hive.map.aggr=true;
--if files are ORC, check PPD:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thank you for the settings, let me try out these settings.. I cannot change the change the input split as we have restriction on no. of mappers a job can spin up. – nilesh1212 May 23 '19 at 05:32