Redux: How do I get Jython to use Python modules stored in Lib within its own jar file when running in Hadoop?

Question

I'm attempting to use Jython for an implementation within Hadoop 1.2.1. I have seen strikingly little about Jython+Hadoop other than stale projects (like code.google.com/p/happy), and a stale implementation in $HADOOP_HOME/src/examples/python/WordCount.py, so perhaps I'm barking up the wrong tree to begin with... but this seems reasonable and possible. I am also very aware of Hadoop Streaming, with which I can use Python in Hadoop without using Jython, but that's not what I'm trying to do here.

Basically, when I invoke the embedded/standalone Jython jar file using java -jar /full/path/to/myjythonjar.jar, the /full/path/to/myjythonjar.jar/Lib is in my Python sys.path, but when I invoke using bin/hadoop jar /full/path/to/myjythonjar.jar input output the ...jar/Lib is not in my path, and the script can't find the Python modules I'm referencing.

Here's what I'm doing...

I'm using the standalone version of the Jython jar, and using the JarRunner interface, roughly as described on SO here and other places; essentially as follows:

 cp jython-standalone-2.7-b1.jar jythonsalib_test.jar  
 jar ufe jythonsalib_test.jar org.python.util.JarRunner __run__.py

That is, take a copy of the standalone jar, add my script with name __run__.py, and change the Manifest to execute JarRunner -- many thanks to @Frank Wierzbicki for that gem.

This all works fine when I'm running directly as, e.g.,

java -jar jythonsalib_test.jar

My sys.path reports that it includes '/full/path/to/jar/file/jythonsalib_test.jar/Lib', which is exactly what I expect, and it is the path from which I'm getting the Python modules (empirically tested by setting sys.path to null-list (fails) and ONLY that path (works)).

When I run this same jar in Hadoop, e.g., as

bin/hadoop jar /full/path/to/jar/file/jythonsalib_test.jar input output

sys.path only includes

['__classpath__', '__pyclasspath__']

I've also used the Jython standalone jar versions 2.5.4-rc1 (which has the same behavior described above) and 2.5.3 (that doesn't work for me for unrelated reasons).

As pointed out in other SO answers, the workaround I'm currently using is basically to directly add my Lib directory of my jar, inside of the Jython script like

import sys
sys.path.append('/full/path/to/jar/file/jythonsalib_test.jar/Lib')

And this basically works -- but this is meant to be a distributed application! There is no path that I can reference in this way. Other SO articles suggest various mechanisms, but are all basically adding to library paths (again, no links because I have <10rep) by Python like above, Java, or Jython installation or Jython "registry" (startup/rc) files. Sure, I could use HDFS or bootstrapping mechanisms or other mechanisms to distribute something to the compute nodes, like the jar or Jython or whatever, but the code is already in the jar! So I shouldn't need to distribute it again, separately...

So, in sum: It looks like I need to be on a filesystem that can directly, and separately, reference the jar file containing Python modules. (akin to the old java -jar jythonjar.jar -jar jythonjar.jar) How do I convince an embedded, standalone Jython jar to always use the Python modules with in the Lib subdirectory of the Jar file, without separately pointing to (potentially the same) jar file?

Or: how do I add a relative path link to the current jar file...? Or am I missing something more insidious and fundamental about Hadoop or Jython or Java or...?

I had a boatload more links, but SO tells me that I can only have TWO links because I'm new here. I hope some day to get enough rep to be able to truly contribute to this fantastic site! :)

Anyway. LTWFTW -- long time watcher, first time writer -- many thanks!

Your question is too long. Its challenging for anyone to go thru soo much to answer. I suggest breaking it down and/or asking what is "most" relevant. — Siddharth, Apr 29 '14 at 04:44
@Siddharth Would you rather have it lack detail? I think this is a really nice descriptive question and I appreciate the effort that hoc_age put in to this (especially for a first question!). — bjb568, Apr 29 '14 at 05:10
So do I. I see @hoc_age so mature for our community and appreciate the time he has put in. That said, I wanted to warn him about the fact that people generally don't read such long and information heavy questions. He would attract better answers if be thought about breaking it down a bit. Wont you agree ? — Siddharth, Apr 29 '14 at 14:45
Indeed -- I agree with both of you, @{Siddharth,bjb568} but can't tag both of you. I tried to front-load important info, and end with **bolded** question for minimally-painful parsing. As I mention in my [_other_ first post](http://meta.stackoverflow.com/questions/252149/how-does-a-new-user-get-started-on-stack-overflow), I've been a lurker for years and appreciate brevity. Thanks for the gentle reminder. Turns out my question is flawed for other reasons. I'm still listening, if anyone has thoughts; but another, shorter, question is forthcoming... — hoc_age, Apr 29 '14 at 19:50

score 1 · Answer 1 · answered Jun 03 '14 at 03:12

1

I wonder if packaging your app with OneJar would improve things. Please try and report back. I´m just shooting in the dark here.

answered Jun 03 '14 at 03:12

aissacf

141
2

score 0 · Answer 2 · edited May 23 '17 at 12:29

Hadoop (version 2.6.0-cdh5.4.2 running MR1 jobs) + Jython (version 2.7.0) only has this problem in the launching phase: that is, when the main or Tool code is running, Jython's sys.prefix is null and sys.path doesn't contain the /path/to/jarfile.jar/Lib entry that you need, resulting in the error message. In the remote mapper code, the sys.path is correctly set.

One option is to only use Jython in the remote mappers and reducers.

If you need to run Jython in the launching phase, you can edit the sys.path manually (before the first call to PythonInterpreter).

String pathToJar = getClass().getProtectionDomain().getCodeSource().getLocation().getPath().toString();
PySystemState sys = Py.getSystemState;
sys.path.insert(0, new PyString(pathToJar));

See this SO question (or elsewhere) for the pathToJar trick. If you first look at pathToJar, you might think it's not going to work because when you run it in Hadoop, you actually get the path to the exploded jar in a temporary directory, rather than the original jar file. That's okay: this exploded directory has a Lib directory and Jython picks up the exploded one, rather than the jarred one.

Finally, I'm also assuming that your original job jar is a jar-with-dependencies that depends on jython-standalone and excludes hadoop-core, as is normally the case for Hadoop job jars.

Redux: How do I get Jython to use Python modules stored in Lib within its own jar file when running in Hadoop?

2 Answers2