I want to perform geoip lookups of my data in spark. To do that I'm using MaxMind's geoIP database.
What I want to do is to initialize a geoip database object once on each partition, and later use that to lookup the city related to an IP address.
Does spark have an initialization phase for each node, or should I instead check whether an instance variable is undefined, and if so, initialize it before continuing? E.g. something like (this is python but I want a scala solution):
class IPLookup(object):
database = None
def getCity(self, ip):
if not database:
self.database = self.initialise(geoipPath)
...
Of course, doing this requires spark will serialise the whole object, something which the docs caution against.