Is there a way in Pig to ship an archived file in a manner similar to hadoop -archives

Question

When running a java mapreduce job via hadoop, you can specify -archives option to select archived files that are to be uploaded with the job and that are automatically unarchived so that code can access those files.

Is there any equivalent thing in Pig? I'm writing a UDF that uses a library (that I dont have access to its source code). This library requires a path to a directory from which it loads some files inside.

How can I ship such a directory with a pig job?

score 1 · Accepted Answer · edited May 23 '17 at 12:04

The answer to this turns out to be simple and already mentioned in https://stackoverflow.com/a/4966099

The correct way to do it then is

Put the file you want to be available locally for each job in the dfs
Run pig letting it know that it should use that file from the dfs as follows

pig ... -Dmapred.cache.archives=hdfs://host:port/path/GeoIP.dat.zip#GeoIP.dat -Dmapred.create.symlink=yes...

score 0 · Answer 2 · answered Mar 16 '12 at 16:59

0

Take a look at ship.

answered Mar 16 '12 at 16:59

Donald Miner

38,889
8
95
118

ship requires that you are using streaming i guess! – humanzz Mar 28 '12 at 15:13

Is there a way in Pig to ship an archived file in a manner similar to hadoop -archives

2 Answers2