Pig streaming_python udfs and shipping files or zipped archives

Question

I am using Pig with streaming_python UDFs and I am wondering if there exists some way how to ship my developed files together with registration of my streaming_python UDFs.

When I use Jython or java, then it is simple I can either put all the dependencies to the .jar or use something like:

REGISTER /home/hadoop/jython_libs/jyson-1.0.2.jar;

This approach unfortunately does not work for .py or .zip files.

I have also find out that when I am using standard streaming I can use ship in the command http://pig.apache.org/docs/r0.14.0/basic.html#define-udfs, but in this case I wouldn't be able to use streaming_python UDFs where is already implemented serialisation and deserialisation from/to pig.

I am currently using pig as follows:

-- load articles
ARTICLES = LOAD 's3://my_articles/articles/*' USING TextLoader as (json);

-- register udfs
REGISTER '/home/hadoop/lda_scripts.py' USING streaming_python AS lda_udfs;

-- transform
TOPICS = foreach ARTICLES generate lda_udfs.transform_json_sparse(json);

-- execute pipeline and print
dump TOPICS;

And I am pretty much following: https://help.mortardata.com/technologies/pig/writing_python_udfs

And I have gained also some information from: How do I make Hadoop find imported Python modules when using Python UDFs in Pig? But I can't install all packages in bootstrap script via pip, I need to ship some files.

Does anyone have some experience of shipping custom Python packages and files to the workers? Is there some simple way how to do it?

Pig streaming_python udfs and shipping files or zipped archives

0 Answers0