I have scenario, where I have list of HDFS location, which will be processed in one MR job, some of dataset can be present in multiple location. Ex:
Data set Id: dataset1, dataset2, dataset3.
HDFLocation1[dataset1,dataset2] (means this file have data for dataset1 and dataset2)
HDFLocation2[dataset1,dataset3]
I have below map, which have hdfs location need to process for give dataset.
[dataset1:HDFLoca1]
[dataset2:HDFLoca2]
[dataset3:HDFLoca2]
I am thinking to implement below logic:
in Map method
- fetch data set id (Ex:dataset1)
- get Current HDFS location
- Check with provided map if its desire location
- Skip or process the data based on step no 3.
I have seen How to get the input file name in the mapper in a Hadoop program? but this does not work with Clodera version which I am using (Hadoop-core-2.5.1, CDH-5.3.1).