1

I think my question get confused to everyone.Making little more clear. I am trying to order my data. say my data(few records) is like this

0 1 2 3 4
1 3 8 9 2
2 8 7 9 7

and my block size is 128 MB and file size is 380 Mb(3 blocks) I am trying to give an order number to my records.

1,0 1 2 3 4
2,1 3 8 9 2
3,2 8 7 9 7

For giving the correct number I need to get data into 1 map else if I get 3 map tasks my numbering wont be correct.

So if I am doing so I will get whole data as it is right? No changes will be happened to the data that get entered to my mapper class, it will be my original data,is'nt it?

And once I set no of mappers to 1 using

 -D mapreduce.job.maps=1

or

conf.setInt("mapreduce.job.running.map.limit", 1);

my output generates 3 part-m-000* files

I am using Hadoop 2.6.0-cdh5.4.7 cloudera version.

Am I doing anything wrong? Please advice

USB
  • 6,019
  • 15
  • 62
  • 93

4 Answers4

1
  • Number of mappers

    -Dmapreduce.job.maps=1
    

    This can be used for specifying the default number of mapper tasks per job.

    But, when you submit the job, the JobSubmitter overrides this parameter, based on the number of splits:

    LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
    int maps = writeSplits(job, submitJobDir);
    conf.setInt(MRJobConfig.NUM_MAPS, maps);
    

    In the code above, MRJobConfig.NUM_MAPS is:

    public static final String NUM_MAPS = "mapreduce.job.maps";
    

    and it gets set to number of splits, computed by writeSplits() method.

    Hence, your setting does not take effect.

  • Mapper limit

    conf.setInt("mapreduce.job.running.map.limit", 1);
    

    This setting just controls the maximum number of simultaneous mappers.

Manjunath Ballur
  • 6,287
  • 3
  • 37
  • 48
0

If you wanna sort your data its important that reduce is part of your job. If you wanna have n sorted files, then plain reduce will do, if you wanna have a single output file then you need to set the number of reducers to 1 (similar to what you did for map).

Setting the number of mappers to 1 has no impact on what you're trying to achieve other then slowing the job down!

oae
  • 1,513
  • 1
  • 17
  • 23
  • My intension is not sorting of data. I need to give order number for my data – USB Jan 06 '16 at 04:44
  • Ok, i see. Did you also turn off reduce ? Think you need to set the number of reducers to 0: conf.setNumReduceTasks(0) – oae Jan 07 '16 at 09:30
  • Ok, you cannot really set the number of maps this way, because it depends on how many splits your InputFormat is creating for the job. If it creates 3 splits then its 3 tasks, so usually InputFormats take the configured number of mappers as a hint, but there is no guarantee for that. So if your really wanna force an map-task count of one have a look at the InputFormats and their options. There should be something like CombinedInputFormat as well. However, question is if using Hadoop for that task is still beneficial , because you removing all parallelism! – oae Jan 11 '16 at 07:47
  • Yes u r right. But I wanted to try Matrix multiplication.With plain 2 large matrices we cannot do matrix multiplication as we cannot gaurentee the computation as data will not be of same order(as it comes from different splits). For for doing that I was trying to add row column dimesion to my data [hint](http://magpiehall.com/two-step-matrix-multiplication-with-hadoop/) – USB Jan 12 '16 at 05:10
0

Instead of setting number of mappers to 1, solve the problem in different way by using Secondary Sorting at Mapper end.

With a slight manipulation to the format of the key object, secondary sorting gives us the ability to take the value into account during the sort phase.

Have a look at this article for working code example in java.

Have a look at this question too : hadoop map reduce secondary sorting

If you still need only one Map task and your paramaters are getting ignored by framework, go for non-splittable hadoop compression file types like gzip ( For uncompressed data size less than 1 GB)

Have a look at this presentation for more details.

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
0

Description of mapreduce.job.maps here states

Ignored when mapreduce.jobtracker.address is "local"

So, if you are running in your local machine, that may explain why you have 3 mappers.

Coming to sorting, a map method where the application code is written works on a single input . So, if you want the sort happen map phase it gets complicated. On the other hand, it is almost straight forward if you do the sort in reduce side.

PonMaran
  • 66
  • 3