2

I am new to PIG and Hadoop. I have written a PIG UDF which operates on String and returns a string. I actually use a class from an already existing jar which contains the business logic in the udf. The class constructor takes 2 filenames as input which it uses for building some dictionary used for processing the input. How to get it working in mapreduce mode I tried passing the filenames in pig local mode it works fine. But I dont know how to make it work in mapreduce mode? Can distributed cache solve the problem?

Here is my code

REGISTER tokenParser.jar

REGISTER sampleudf.jar;


DEFINE TOKENPARSER com.yahoo.sample.ParseToken('conf/input1.txt','conf/input2.xml');

A = LOAD './inputHOP.txt' USING PigStorage() AS (tok:chararray);
B = FOREACH A GENERATE TOKENPARSER(tok);
STORE B into 'newTokout' USING PigStorage();

From what I understand is tokenParser.jar must be using some sort of BufferedInputReader. Is it possible to make it work without changing tokenParser.jar

Tom Praison
  • 41
  • 1
  • 5

1 Answers1

1

Yes, like in this similar question using the distributed cache is a good way to solve this problem.

Community
  • 1
  • 1
Romain
  • 7,022
  • 3
  • 30
  • 30