I wanted to make a traffic report on country from nginx access.log file. This is my code snippet using Apache Spark on Python:
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonAccessLogAnalyzer")
def get_country_from_line(line):
try:
from geoip import geolite2
ip = line.split(' ')[0]
match = geolite2.lookup(ip)
if match is not None:
return match.country
else:
return "Unknown"
except IndexError:
return "Error"
rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
ips = rdd.countByValue()
print ips
sc.stop()
On a 6GB log file, it took an hour to complete the task (I ran on my Macbook Pro, 4 cores) which is too slow. I think the bottle neck is that whenever spark maps a line, it has to import geolite2
which has to load some database I think. Is there anyway for me to import geolite2
on each worker instead of each line? Would it boost the performance? Any suggestion to improve that code?