4

The context of this question is that I am trying to use the maxmind java api in a pig script that I have written... I do not think that knowing about either is necessary to answer the question, however.

The maxmind API has a constructor which requires a path to a file called GeoIP.dat, which is a comma delimited file which has the information it needs.

I have a jar file which contains the API, as well as a wrapping class which instantiates the class and uses it. My idea is to package the GeoIP.dat file into the jar, and then access it as a resource in the jar file. The issue is that I do not know how to construct a path that the constructor can use.

Looking at the API, this is how they load the file:

public LookupService(String databaseFile) throws IOException {
    this(new File(databaseFile));
}


public LookupService(File databaseFile) throws IOException {
    this.databaseFile = databaseFile;
    this.file = new RandomAccessFile(databaseFile, "r");
    init();
}

I only paste that because I am not averse to editing the API itself to make this work, if necessary, but do not know how I could replicate the functionality I as such. Ideally I'd like to get it into the file form, though, or else editing the API will be quite a chore.

Is this possible?

A Question Asker
  • 3,339
  • 7
  • 31
  • 39

6 Answers6

2

This works for me.

Assuming you have a package org.foo.bar.util that contains GeoLiteCity.dat

URL fileURL = this.getClass().getResource("org/foo/bar/util/GeoLiteCity.dat");
File geoIPData = new File(fileURL.toURI());
LookupService cl = new LookupService(geoIPData, LookupService.GEOIP_MEMORY_CACHE );
2

Try:

new File(MyWrappingClass.class.getResource(<resource>).toURI())
Puce
  • 37,247
  • 13
  • 80
  • 152
  • If the jar is in the effective classpath, it should as you can have URIs representing resources over the network, on your filesystem or within your class path. I've done similarly around this suggestion. However, it's been a while and I could be missing a specific detail or two. However, this (or something close to that) should work for jar files in one's classpath. – luis.espinal Feb 10 '11 at 16:34
  • I used it at least in a unit test to access a resource in src/test/resources – Puce Feb 10 '11 at 16:35
  • I tried this and it did not work. I tried this specifically: File f = new File(getClass().getResource("/GeoIp.dat").toURI()); and it failed. I will try your specific syntax, but I think mine should be fine? I did a toString and it takes the form: jar:file:/home/aquestion/udfs/maxmind/jar/maxmind.jar!/GeoIp.dat. perhaps that has to be parsed in a certain way to be readable? – A Question Asker Feb 10 '11 at 17:05
2

dump your data to a temp file, and feed the temp file to it.

File tmpFile = File.createTempFile("XX", "dat");
tmpFile.deleteOnExit();

InputStream is = MyClass.class.getResourceAsStream("/path/in/jar/XX.dat");
OutputStream os = new FileOutputStream(tmpFile)

read from is, write to os, close
irreputable
  • 44,725
  • 9
  • 65
  • 93
2

One recommended way is to use the Distributed Cache rather than trying to bundle it into a jar.

If you zip GeoIP.dat and copy it on hdfs://host:port/path/GeoIP.dat.zip. Then add these options to the Pig command:

pig ...
  -Dmapred.cache.archives=hdfs://host:port/path/GeoIP.dat.zip#GeoIP.dat 
  -Dmapred.create.symlink=yes
...

And LookupService lookupService = new LookupService("./GeoIP.dat"); should work in your UDF as the file will be present locally to the tasks on each node.

Romain
  • 7,022
  • 3
  • 30
  • 30
  • 2
    Since Pig 0.9.0, the `EvalFunc` interface has a method `getCacheFiles`, in which a list of HDFS paths can be given. The correpsonding files are then accessible throught the distributed cache, for example with `FileReader fr = new FileReader("./some.file");`. See [PIG-1752](https://issues.apache.org/jira/browse/PIG-1752) – maxjakob Jun 25 '12 at 08:31
1

Use the classloader.getResource(...) method to do the file lookup in the classpath, which will pull it from the JAR file.

This means you will have to alter the existing code to override the loading. The details on how to do that depend heavily on your existing code and environment. In some cases subclassing and registering the subclass with the framework might work. In other cases, you might have to determine the ordering of class loading along the classpath and place an identically signed class "earlier" in the classpath.

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
1

Here's how we use the maxmind geoIP;

We put the GeoIPCity.dat file into the cloud and use the cloud location as an argument when we launch the process. The code where we get the GeoIPCity.data file and create a new LookupService is:

if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
    List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
    for (Path localFile : localFiles) {
        if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
            m_geoipLookupService = new LookupService(new File(localFile.toUri().getPath()));
        }
    }
}

Here is an abbreviated version of command we use to run our process

$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/analytics/libjars/geoiplookup.jar

The critical components of this for running the MindMax component is the -files and -libjars. These are generic options in the GenericOptionsParser.

-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.

I'm assuming that Hadoop uses the GenericOptionsParser because I can find no reference to it anywhere in my project. :)

If you put the GeoIPCity.dat on the could and specify its using the -files argument, it will be put into the local cache which the mapper can then get in the setup function. It doesn't have to be in setup but only needs to be done once per mapper so is an excellent place to put it. Then use the -libjars argument to specify the geoiplookup.jar (or whatever you've called yours) and it will be able to use it. We don't put the geoiplookup.jar on the cloud. I'm rolling with the assumption that hadoop will distribute the jar as it needs to.

I hope that all makes sense. I am getting fairly familiar with hadoop/mapreduce, but I didnt' write the pieces that use the maxmind geoip component in the project, so I've had to do a little digging to understand it well enough to do the explanation I have here.

EDIT: Additional description for the -files and -libjars -files The files argument is used to distribute files through Hadoop Distributed Cache. In the example above, we are distributing the Max Mind geo-ip data file through the Hadoop Distributed Cache. We need access to the Max Mind geo-ip data file to map the user’s ip address to appropriate country, region, city, timezone. The API requires that data file be present locally which is not feasible in a distributed processing environment (we will not be guaranteed which nodes in the cluster will process the data). To distribute the appropriate data to the processing node, we use the Hadoop Distributed Cache infrastructure. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –file argument. Please note that the file we distribute should be available in the cloud (HDFS). -libjars The –libjars is used to distribute any additional dependencies required by the map-reduce jobs. Like the data file, we also need to copy the dependent libraries to the nodes in the cluster where the job will be run. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –libjars argument.

QuinnG
  • 6,346
  • 2
  • 39
  • 47