12

I have a MapReduce job defined in main.py, which imports the lib module from lib.py. I use Hadoop Streaming to submit this job to the Hadoop cluster as follows:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py 
    -mapper "./main.py map" -reducer "./main.py reduce" 
    -input input -output output

In my understanding, this should put both main.py and lib.py into the distributed cache folder on each computing machine and thus make module lib available to main. But it doesn't happen: from the log I see that files are really copied to the same directory, but main can't import lib, throwing ImportError.

Why does this happen and how can I fix it?

UPD. Adding the current directory to the path didn't work:

import sys    
sys.path.append(os.path.realpath(__file__))
import lib
# ImportError

though, loading the module manually did the trick:

import imp
lib = imp.load_source('lib', 'lib.py')

But that's not what I want. So why does the Python interpreter see other .py files in the same directory, but can't import them? Note that I have already tried adding an empty __init__.py file to the same directory without effect.

simleo
  • 2,775
  • 22
  • 23
ffriend
  • 27,562
  • 13
  • 91
  • 132
  • Have you checked `sys.path` in `main.py` to make sure the working directory is included ? – lmjohns3 Aug 09 '13 at 15:26
  • @lmjohns3: yes, working directory is on the classpath. BTW, isn't it automatically included for running script? (just curious) – ffriend Aug 09 '13 at 15:32
  • I believe that's true for Python scripts that are started on the command-line, but Hadoop streaming might be starting a Python interpreter in another way (not really sure). Either way, I still think this sounds like a path issue. See http://www.litfuel.net/plush/?postid=195 for one possibility to distribute your modules in a different way. Alternatively, try writing your commands into a shell script and passing that for the `-mapper` and `-reducer` command-line arguments. – lmjohns3 Aug 09 '13 at 15:35
  • @lmjohns3: yeah, I have seen the trick with module, but it puts some restrictions on `main`, while I'm trying to keep importing as simple as possible. The point is to create distributable library that you can just `import`. – ffriend Aug 09 '13 at 15:59

3 Answers3

16

I posted the question to Hadoop user list and finally found the answer. It turns out that Hadoop doesn't really copy files to the location where the command runs, but instead creates symlinks for them. Python, in its turn, can't work with symlinks and thus doesn't recognize lib.py as Python module.

Simple workaround here is to put both main.py and lib.py into the same directory, so that symlink to the directory is placed into MR job working directory, while both files are physically in the same directory. So I did the following:

  1. Put main.py and lib.py into app directory.
  2. In main.py I used lib.py directly, that is, import string is just

    import lib

  3. Uploaded app directory with -files option.

So, final command looks like this:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files app 
       -mapper "app/main.py map" -reducer "app/main.py reduce" 
       -input input -output output 
ffriend
  • 27,562
  • 13
  • 91
  • 132
5

When Hadoop-Streaming starts the python scripts, your python script's path is where the script file really is. However, hadoop starts them at './', and your lib.py(it's a symlink) is at './', too. So, try to add 'sys.path.append("./")' before you import lib.py like this: import sys sys.path.append('./') import lib

Muyoo
  • 178
  • 2
  • 8
  • I'm using yarn, which seems to not support -files like the selected answer uses. This worked great, thanks! – bkribbs Apr 12 '17 at 18:56
1

The -files and -archive switches are just shortcuts to Hadoop's distributed cache (DC), a more general mechanism that also allows to upload and automatically unpack archives in the zip, tar and tgz/tar.gz formats. If rather than by a single module your library is implemented by a structured Python package, the latter feature is what you want.

We are directly supporting this in Pydoop since release 1.0.0-rc1, where you can simply build a mypkg.tgz archive and run your program as:

pydoop submit --upload-archive-to-cache mypkg.tgz [...]

The relevant docs are at http://crs4.github.io/pydoop/self_contained.html and here is a full working example (requires wheel): https://github.com/crs4/pydoop/tree/master/examples/self_contained.

simleo
  • 2,775
  • 22
  • 23