I have a Python UDF that uses lxml. My Pig job that uses the UDF fails:
File "PigParse.py", line 10, in ParseToPig ImportError: No module named lxml
The Python script works fine as a stand alone program, its line 10 is:
from lxml import etree
Do I need to distribute lxml to the hadoop cluster somehow, and if so, how and which version should I use?
I have seen examples of distributing nltk using Hadoop -file but nothing for Pig.
TIA!!!