4

I am working on a map-reduce job, consisting multiple steps. Using mrjob every step receives previous step output. The problem is I don't want it to.

What I want is to extract some information and use it in second step against all input and so on. Is it possible to do this using mrjob?

Note: Since I don't want to use emr, this question is not much of help to me.

UPDATE: If it would not be possible to do this on a single job, I need to do it in two separate jobs. In this case, is there any way to wrap these two jobs and manage intermediate outpus, etc?

Community
  • 1
  • 1
Mehraban
  • 3,164
  • 4
  • 37
  • 60
  • Not sure if I understand you. But did you consider to use Oozie or Spring? – Radek Tomšej Sep 30 '14 at 11:13
  • The question seems a little abstract. Can you make your points more clear by showing us some code, and what exactly are your trying to do? – Abhishek Pathak Sep 30 '14 at 12:18
  • @RadekTomšej I tried sth like sonic's answer. What are the benefits of using Oozie or Spring over that approach? Could you provide an answer with some examples? – Mehraban Oct 03 '14 at 14:09
  • @AbhishekPathak Take a look at sonic's answer. That's what I needed to do. But I'd rather do my tasks in a **single** jab. – Mehraban Oct 03 '14 at 14:55
  • @SAM : you can use oozie workflow to define map-reduce actions, first action 'ok to' should be set as second action, also the input for the second action should contain the output of first action and the original input. But here also its going to be two jobs.. invoked through oozie – vishnu viswanath Oct 05 '14 at 08:57
  • and you can have a look at this : https://github.com/klbostee/dumbo – vishnu viswanath Oct 05 '14 at 09:09

1 Answers1

2

You can use Runners

You will have to define the jobs separately and use another python script to invoke it.

from NumLines import NumLines
from WordsPerLine import WordsPerLine
import sys

intermediate = None

def firstJob(input_file):
    global intermediate
    mr_job = NumLines(args=[input_file])
    with mr_job.make_runner() as runner:
        runner.run()
        intermediate = runner.get_output_dir()

def secondJob(input_file):
    mr_job = WordsPerLine(args=[intermediate,input_file])
    with mr_job.make_runner() as runner:
        runner.run()

if __name__ == '__main__':
    firstJob(sys.argv[1]) 
    secondJob(sys.argv[1])

and can be invoked by:

python main_script.py input.txt
vishnu viswanath
  • 3,794
  • 2
  • 36
  • 47