I am using Pig with streaming_python UDFs and I am wondering if there exists some way how to ship my developed files together with registration of my streaming_python UDFs.
When I use Jython or java, then it is simple I can either put all the dependencies to the .jar or use something like:
REGISTER /home/hadoop/jython_libs/jyson-1.0.2.jar;
This approach unfortunately does not work for .py or .zip files.
I have also find out that when I am using standard streaming I can use ship in the command http://pig.apache.org/docs/r0.14.0/basic.html#define-udfs, but in this case I wouldn't be able to use streaming_python UDFs where is already implemented serialisation and deserialisation from/to pig.
I am currently using pig as follows:
-- load articles
ARTICLES = LOAD 's3://my_articles/articles/*' USING TextLoader as (json);
-- register udfs
REGISTER '/home/hadoop/lda_scripts.py' USING streaming_python AS lda_udfs;
-- transform
TOPICS = foreach ARTICLES generate lda_udfs.transform_json_sparse(json);
-- execute pipeline and print
dump TOPICS;
And I am pretty much following: https://help.mortardata.com/technologies/pig/writing_python_udfs
And I have gained also some information from: How do I make Hadoop find imported Python modules when using Python UDFs in Pig? But I can't install all packages in bootstrap script via pip, I need to ship some files.
Does anyone have some experience of shipping custom Python packages and files to the workers? Is there some simple way how to do it?