Loading my own python modules for Pig UDFs on Amazon EMR

Question

I am trying to call two of my own modules from Pig.

Here's module_one.py:

import sys 
print sys.path

def foo():
    pass

Here's module_two.py:

from module_one import foo

def bar():
    foo()

I got both of them into s3.

Here's what I get when trying to import them into Pig:

 
2015-06-14 12:12:10,578 [main] INFO  org.apache.pig.Main - Apache Pig version 0.12.0-amzn-2 (rexported) compiled May 05 2015, 19:03:23
2015-06-14 12:12:10,579 [main] INFO  org.apache.pig.Main - Logging error messages to: /mnt/var/log/apps/pig.log
2015-06-14 12:12:10,620 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found
2015-06-14 12:12:11,277 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-14 12:12:11,279 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-14 12:12:11,279 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://1.1.1.1:9000
2015-06-14 12:12:12,794 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS

grunt> REGISTER 's3://mybucket/pig/module_one.py' USING jython AS m1;
2015-06-14 12:12:15,177 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-14 12:12:17,457 [main] INFO  com.amazon.ws.emr.hadoop.fs.EmrFileSystem - Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2015-06-14 12:12:17,889 [main] INFO  amazon.emr.metrics.MetricsSaver - MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 
2015-06-14 12:12:17,889 [main] INFO  amazon.emr.metrics.MetricsSaver - Created MetricsSaver j-5G45FR7N987G:i-a95a5379:RunJar:03073 period:60 /mnt/var/em/raw/i-a95a5379_20150614_RunJar_03073_raw.bin
2015-06-14 12:12:18,633 [main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem - Opening 's3://mybucket/pig/module_one.py' for reading
2015-06-14 12:12:18,661 [main] INFO  amazon.emr.metrics.MetricsSaver - Thread 1 created MetricsLockFreeSaver 1
2015-06-14 12:12:18,743 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_4599752347759040376
2015-06-14 12:12:21,060 [main] WARN  org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.
['/home/hadoop/.versions/pig-0.12.0-amzn-2/lib/Lib', '/home/hadoop/.versions/pig-0.12.0-amzn-2/lib/jython-standalone-2.5.3.jar/Lib', 'classpath', 'pyclasspath/', '/home/hadoop']
2015-06-14 12:12:21,142 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: m1.foo

grunt> REGISTER 's3://mybucket/pig/module_two.py' USING jython AS m2;
2015-06-14 12:12:33,870 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-14 12:12:33,918 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-14 12:12:34,020 [main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem - Opening 's3://mybucket/pig/module_two.py' for reading
2015-06-14 12:12:34,064 [main] WARN  org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.
2015-06-14 12:12:34,621 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
  File "/tmp/pig1436120267849453375tmp/module_two.py", line 1, in 
    from module_one import foo
ImportError: No module named module_one
Details at logfile: /mnt/var/log/apps/pig.log

I tried:

The usual sys.path.append('./Lib') and sys.path.append('.'), didn't help
Hacking a folder location with sys.path.append(os.path.dirname(__file__)) but got NameError: name '__file__' is not defined
Creating a __init__.py and loading it with REGISTER
sys.path.append('s3://mybucket/pig/') didn't work either.

I'm using Apache Pig version 0.12.0-amzn-2 since that's the only one that apparently can now be selected.

Hello time travellers from the future! For the record, I pushed all of my python stuff into one big file and I am using it by `REGISTER 's3://mybucket/pig/one_big_pile_of_stuff.py' USING jython AS myfuncs;`. — MaratC, Jun 18 '15 at 13:27

glefait · Answer 1 · 2015-06-15T15:58:52.127

0

You are importing the first python udf as m1, therefore you should access its namespace with m1.foo(), and not from module_one.

Edit : the second python file should be :

from m1 import foo

def bar():
    foo()

I just tested it on amazon EMR and it works.

edited Jun 15 '15 at 15:58

answered Jun 15 '15 at 06:41

glefait

1,651
1
13
11

I did `REGISTER 's3://mybucket/pig/module_one.py' USING jython AS module_one;` and I still get `ImportError: No module named module_one` on trying to register the second one. – MaratC Jun 15 '15 at 15:15
I updated my answer with the content of the second script. GRUNT seems happy with that ;) – glefait Jun 15 '15 at 15:59
Strange. I get `ImportError: No module named m1` . – MaratC Jun 16 '15 at 14:15

score 0 · Answer 2 · edited May 23 '17 at 10:27

Based on what I've found here: How do I get the path and name of the file that is currently executing?, I managed to register the path that contains the custom module I want to load in my Pig UDF by doing:

import inspect, os, sys
sys.path.append(os.path.dirname(os.path.abspath(inspect.stack()[0][1])))
import myModule

So if your module_two is running from the same folder as module_one that it includes, this should make it work for Pig.

Loading my own python modules for Pig UDFs on Amazon EMR

2 Answers2