4

Working with pigtmp$ pig --version Apache Pig version 0.8.1-cdh3u1 (rexported) compiled Jul 18 2011, 08:29:40

I have a python script (c-python), which imports another script, both very simple in my example:

DATA example$ hadoop fs -cat /user/pavel/trivial.log

1   one
2   two
3   three

EXAMPLE WITHOUT INCLUDE - works fine

example$ pig -f trivial_stream.pig

(1,1,one)
()
(1,2,two)
()
(1,3,three)
()

where 1) trivial_stream.pig:

DEFINE test_stream `test_stream.py` SHIP ('test_stream.py');
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
C = STREAM A THROUGH test_stream;
DUMP C;

2) test_stream.py

#! /usr/bin/env python

import sys
import string

for line in sys.stdin:
    if len(line) == 0: continue
    new_line = line
    print "%d\t%s" % (1, new_line) 

So essentially I just aggregate lines with one key, nothing special.

EXAMPLE WITH INCLUDE - bombs! Now I'd like to append a string from a python import module which sits in the same directory as test_stream.py. I've tried to ship the import module in many different ways but get the same error (see below)

1) trivial_stream.pig:

DEFINE test_stream `test_stream.py` SHIP ('test_stream.py', 'test_import.py');
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
C = STREAM A THROUGH test_stream;
DUMP C;

2) test_stream.py

#! /usr/bin/env python

import sys
import string

import test_import

for line in sys.stdin:
    if len(line) == 0: continue
    new_line = ("%s-%s") % (line.strip(), test_import.getTestLine())
    print "%d\t%s" % (1, new_line) 

3) test_import.py

def getTestLine():
    return "test line";

Now

example$ pig -f trivial_stream.pig

Backend error message

org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:265)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.cleanup(PigMapBase.java:103)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)

Pig Stack Trace

ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
    at org.apache.pig.PigServer.openIterator(PigServer.java:753)
    at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
    at org.apache.pig.Main.run(Main.java:396)
    at org.apache.pig.Main.main(Main.java:107)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337)
    at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
    at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
    at org.apache.pig.PigServer.storeEx(PigServer.java:885)
    at org.apache.pig.PigServer.store(PigServer.java:827)
    at org.apache.pig.PigServer.openIterator(PigServer.java:739)
    ... 7 more

Thanks you much for your help! -Pavel

Tadeck
  • 132,510
  • 28
  • 152
  • 198
PVM
  • 61
  • 2
  • 8
  • Have you tried just piping some text into it at the shell, to see if you get a Python exception? – Thomas K Nov 22 '11 at 21:01
  • Yes, first thing i did, works fine. – PVM Nov 22 '11 at 21:19
  • 2
    figured it out. rtm. the dependencies aren't shipped, if you want your python app to work with pig you need to tar it (don't forget __init__.py's!), then include the .tar file in pig's SHIP statement. The first thing you do is untar the app. There might be issues with paths, so I'd suggest the following even before tar extraction: sys.path.insert(0, os.getcwd()) – PVM Dec 03 '11 at 00:01
  • Please add this as an answer! – Mike Sukmanowsky Aug 19 '12 at 11:24
  • For people who found this post when looking for [ERROR 1066: Unable to open iterator for alias](http://stackoverflow.com/questions/34495085/error-1066-unable-to-open-iterator-for-alias-in-pig-generic-solution) here is a [generic solution](http://stackoverflow.com/a/34495086/983722). – Dennis Jaheruddin Dec 28 '15 at 14:33

2 Answers2

3

Correct answer from comment above:

The dependencies aren't shipped, if you want your python app to work with pig you need to tar it (don't forget init.py's!), then include the .tar file in pig's SHIP statement. The first thing you do is untar the app. There might be issues with paths, so I'd suggest the following even before tar extraction: sys.path.insert(0, os.getcwd()).

rjurney
  • 4,824
  • 5
  • 41
  • 62
1

You need to append the current directory to sys.path in your test_stream.py:

#! /usr/bin/env python

import sys
sys.path.append(".")

Thus the SHIP command you had there does ship the python script, but you just need to tell Python where to look.

png
  • 5,990
  • 2
  • 25
  • 16