2

I wanted to make a traffic report on country from nginx access.log file. This is my code snippet using Apache Spark on Python:

from pyspark import SparkContext

if __name__ == "__main__":
    sc = SparkContext(appName="PythonAccessLogAnalyzer")
    def get_country_from_line(line):
        try:
            from geoip import geolite2
            ip = line.split(' ')[0]
            match = geolite2.lookup(ip)
            if match is not None:
                return match.country
            else:
                return "Unknown"
        except IndexError:
            return "Error"

    rdd = sc.textFile("/Users/victor/access.log").map(get_country_from_line)
    ips = rdd.countByValue()

    print ips
    sc.stop()

On a 6GB log file, it took an hour to complete the task (I ran on my Macbook Pro, 4 cores) which is too slow. I think the bottle neck is that whenever spark maps a line, it has to import geolite2 which has to load some database I think. Is there anyway for me to import geolite2 on each worker instead of each line? Would it boost the performance? Any suggestion to improve that code?

vutran
  • 2,145
  • 21
  • 34

1 Answers1

0

What about using broadcast variables? Here is the doc which explains how they work. However they are simply read-only variables which are spread to all worker nodes once per worker and then accessed whenever necessary.

mgaido
  • 2,987
  • 3
  • 17
  • 39
  • I already tried broadcasting it but it didn't work. I think the geolite2 contains too many references to complex objects. – vutran Mar 10 '15 at 10:32
  • Please have a look at my gist: https://gist.github.com/tranvictor/fc5e7338abd4fdaffedc – vutran Mar 10 '15 at 11:39
  • 1
    Ok, now it is clear, it manages internally threads so references to them cannot be broadcasted. Well, actually what you need is a map between an ip address and a geo location. Then, I think that the simplest solution is trying finding one and use it as a broadcast variable if you can.... – mgaido Mar 10 '15 at 14:31