1

I am using Pig with streaming_python UDFs and I am wondering if there exists some way how to ship my developed files together with registration of my streaming_python UDFs.

When I use Jython or java, then it is simple I can either put all the dependencies to the .jar or use something like:

REGISTER /home/hadoop/jython_libs/jyson-1.0.2.jar;

This approach unfortunately does not work for .py or .zip files.

I have also find out that when I am using standard streaming I can use ship in the command http://pig.apache.org/docs/r0.14.0/basic.html#define-udfs, but in this case I wouldn't be able to use streaming_python UDFs where is already implemented serialisation and deserialisation from/to pig.

I am currently using pig as follows:

-- load articles
ARTICLES = LOAD 's3://my_articles/articles/*' USING TextLoader as (json);

-- register udfs
REGISTER '/home/hadoop/lda_scripts.py' USING streaming_python AS lda_udfs;

-- transform
TOPICS = foreach ARTICLES generate lda_udfs.transform_json_sparse(json);

-- execute pipeline and print
dump TOPICS;

And I am pretty much following: https://help.mortardata.com/technologies/pig/writing_python_udfs

And I have gained also some information from: How do I make Hadoop find imported Python modules when using Python UDFs in Pig? But I can't install all packages in bootstrap script via pip, I need to ship some files.

Does anyone have some experience of shipping custom Python packages and files to the workers? Is there some simple way how to do it?

Community
  • 1
  • 1
ziky90
  • 2,627
  • 4
  • 33
  • 47

0 Answers0