I have no practical experience in hadoop -- I have only learnt some theory. The task I am faced with is to process a huge CSV file (way larger than memory) using a cluster and I have come up with the following procedure.
Suppose the csv file contains 300 million lines and I call 1-100 million lines part1, 101-200 million lines part2, and 201-300 million lines part3. (This is only an example as in practice the data has to be partitioned into many more parts to be processed in memory)
I want to distribute the data onto the nodes in following way.
Node number Data taken
Node 1 Only part 1
Node 2 Only part 2
Node 3 Only part 3
Node 4 part 1 and 2
Node 5 part 2 and 3
Node 6 part 1 and 3
You see some nodes takes only one part of the data and some take 2. Depending on this, one of two functions are applied to each node. I learnt this can be done via an if-else statement in the reducer. i.e. my reducer should look like this
If (node 1,2,3) run function f1(data_block)
If (node 4,5,6) run function f2(data_blockA,data_blockB)
The problem is that most of the hadoop examples I have learnt do not allow each node to choose which part of the data they want to read. Data are distributed to nodes in a rather black-box way. Is there any way to get around this? P.S. I am thinking to rely on Hadoop stream as my primary language is Python, not Java, so this could be another constraint.