0

When running a java mapreduce job via hadoop, you can specify -archives option to select archived files that are to be uploaded with the job and that are automatically unarchived so that code can access those files.

Is there any equivalent thing in Pig? I'm writing a UDF that uses a library (that I dont have access to its source code). This library requires a path to a directory from which it loads some files inside.

How can I ship such a directory with a pig job?

humanzz
  • 897
  • 2
  • 10
  • 18

2 Answers2

1

The answer to this turns out to be simple and already mentioned in https://stackoverflow.com/a/4966099

The correct way to do it then is

  1. Put the file you want to be available locally for each job in the dfs
  2. Run pig letting it know that it should use that file from the dfs as follows

    pig ... -Dmapred.cache.archives=hdfs://host:port/path/GeoIP.dat.zip#GeoIP.dat -Dmapred.create.symlink=yes...

Community
  • 1
  • 1
humanzz
  • 897
  • 2
  • 10
  • 18
0

Take a look at ship.

Donald Miner
  • 38,889
  • 8
  • 95
  • 118