1

I have an mrjob that consists of 3 steps. The second step expects as input the results of the first step plus some more content from S3.

I understand that I can always "stream" it through the first step, meaning emit is as is, and only use it in the second step, but I would like to avoid this.

Is there a way to define additional input to later steps in mrjob?

Eleni
  • 645
  • 6
  • 19

1 Answers1

0

Instead of grouping the steps into a single job, you might consider using a persistent job flow to separate your task into the parts before and after the secondary input:

Re-use Amazon Elastic MapReduce instance

http://pythonhosted.org/mrjob/guides/emr-advanced.html

Community
  • 1
  • 1
Taro Sato
  • 1,444
  • 1
  • 15
  • 19