Here's how we use the maxmind geoIP;
We put the GeoIPCity.dat
file into the cloud and use the cloud location as an argument when we launch the process.
The code where we get the GeoIPCity.data
file and create a new LookupService
is:
if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
for (Path localFile : localFiles) {
if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
m_geoipLookupService = new LookupService(new File(localFile.toUri().getPath()));
}
}
}
Here is an abbreviated version of command we use to run our process
$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/analytics/libjars/geoiplookup.jar
The critical components of this for running the MindMax component is the -files
and -libjars
. These are generic options in the GenericOptionsParser.
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
I'm assuming that Hadoop uses the GenericOptionsParser
because I can find no reference to it anywhere in my project. :)
If you put the GeoIPCity.dat
on the could and specify its using the -files
argument, it will be put into the local cache which the mapper can then get in the setup
function. It doesn't have to be in setup
but only needs to be done once per mapper so is an excellent place to put it.
Then use the -libjars
argument to specify the geoiplookup.jar (or whatever you've called yours) and it will be able to use it. We don't put the geoiplookup.jar on the cloud. I'm rolling with the assumption that hadoop will distribute the jar as it needs to.
I hope that all makes sense. I am getting fairly familiar with hadoop/mapreduce, but I didnt' write the pieces that use the maxmind geoip component in the project, so I've had to do a little digging to understand it well enough to do the explanation I have here.
EDIT: Additional description for the -files
and -libjars
-files The files argument is used to distribute files through Hadoop Distributed Cache. In the example above, we are distributing the Max Mind geo-ip data file through the Hadoop Distributed Cache. We need access to the Max Mind geo-ip data file to map the user’s ip address to appropriate country, region, city, timezone. The API requires that data file be present locally which is not feasible in a distributed processing environment (we will not be guaranteed which nodes in the cluster will process the data). To distribute the appropriate data to the processing node, we use the Hadoop Distributed Cache infrastructure. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –file argument. Please note that the file we distribute should be available in the cloud (HDFS).
-libjars The –libjars is used to distribute any additional dependencies required by the map-reduce jobs. Like the data file, we also need to copy the dependent libraries to the nodes in the cluster where the job will be run. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –libjars argument.