0

I have scenario, where I have list of HDFS location, which will be processed in one MR job, some of dataset can be present in multiple location. Ex:

Data set Id: dataset1, dataset2, dataset3.
HDFLocation1[dataset1,dataset2] (means this file have data for dataset1 and dataset2)
HDFLocation2[dataset1,dataset3]

I have below map, which have hdfs location need to process for give dataset.

[dataset1:HDFLoca1] 
[dataset2:HDFLoca2]
[dataset3:HDFLoca2]

I am thinking to implement below logic:

in Map method

  1. fetch data set id (Ex:dataset1)
  2. get Current HDFS location
  3. Check with provided map if its desire location
  4. Skip or process the data based on step no 3.

I have seen How to get the input file name in the mapper in a Hadoop program? but this does not work with Clodera version which I am using (Hadoop-core-2.5.1, CDH-5.3.1).

Community
  • 1
  • 1
Vikas Singh
  • 2,838
  • 5
  • 17
  • 32
  • it does not work => ? What is happening? – Ravindra babu Dec 02 '16 at 16:40
  • Alternative way: Could you instead add dataset id to the each record that you are processing and then group by the dataset id. Further process each group in reducer as needed by your application. – Amit Dec 05 '16 at 18:37

0 Answers0