I understand from When do reduce tasks start in Hadoop that the reduce task in hadoop contains three steps: shuffle, sort and reduce where the sort (and after that the reduce) can only start once all the mappers are done. Is there a way to start the sort and reduce every time a mapper finishes.
For example lets we have only one job with mappers mapperA and mapperB and 2 reducers. What i want to do is:
- mapperA finishes
- shuffles copies the appropriate partitions of the mapperAs output lets say to reducer 1 and 2
- sort on reducer 1 and 2 starts sorting and reducing and generates some intermediate output
- now mapperB finishes
- shuffles copies the appropriate partitions of the mapperBs output to reducer 1 and 2
- sort and reduce on reducer 1 and 2 starts again and the reducer merges the new output with the old one
Is this possible? Thanks