Here is the scenario
Reducer1
/
Mapper - - Reducer2
\
ReducerN
In reducer I want to write the data on different files, lets say the reducer looks like
def reduce():
for line in sys.STDIN:
if(line == type1):
create_type_1_file(line)
if(line == type2):
create_type_2_file(line)
if(line == type3):
create_type3_file(line)
... and so on
def create_type_1_file(line):
# writes to file1
def create_type2_file(line):
# writes to file2
def create_type_3_file(line):
# write to file 3
Consider the paths to write as :
file1 = /home/user/data/file1
file2 = /home/user/data/file2
file3 = /home/user/data/file3
When I run in pseudo-distributed mode(machine with one node and hdfs daemons running)
, things are good since all daemons will write to the same set of files
Question:
- If I run this in cluster of 1000 machines, will they write to the same set of files even then? I am writing to local filesystem
in this case, Is there a better way to perform this operation in hadoop streaming
?