I have a MapReduce job defined in main.py
, which imports the lib
module from lib.py
. I use Hadoop Streaming to submit this job to the Hadoop cluster as follows:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py
-mapper "./main.py map" -reducer "./main.py reduce"
-input input -output output
In my understanding, this should put both main.py
and lib.py
into the distributed cache folder on each computing machine and thus make module lib
available to main
. But it doesn't happen: from the log I see that files are really copied to the same directory, but main
can't import lib
, throwing ImportError
.
Why does this happen and how can I fix it?
UPD. Adding the current directory to the path didn't work:
import sys
sys.path.append(os.path.realpath(__file__))
import lib
# ImportError
though, loading the module manually did the trick:
import imp
lib = imp.load_source('lib', 'lib.py')
But that's not what I want. So why does the Python interpreter see other .py
files in the same directory, but can't import them? Note that I have already tried adding an empty __init__.py
file to the same directory without effect.