1

So I'm trying to import about 80 million page views from a log file. I'm trying to put them in the database as sessions, i.e. groups of page views separated with 20 minutes in between.

So eventually, in my user database I would like each user to have a list of dictionary objects like so:

{
    'id': 'user1'
    'sessions':[
                    {
                        "start" : ISODate("2011-04-03T23:21:59.639Z"),
                        "end" : ISODate("2011-04-03T23:50:05.518Z"),
                        "page_loads" : 136
                    },
                    {
                        "start" : ISODate("another date"),
                        "end" : ISODate("later date"),
                        "page_loads" : 20
                    },
                ]
}

Should be fairly simple. So I wrote this script:

howManyLinesTotal = 9999999 #i've done: wc -l in bash before to find the file size

blank_dict = {'page_loads':0, 'start':0, 'end':0}

latest_sessions = defaultdict(lambda: blank_dict)

for line in f: #opens a gigantic gzip file called "f"
    line = line.split('\t') #each entry is tab-delimited with: user \t datetime \t page_on_my_site

    user = line[1] #grab the data from this line in the file
    timestamp = datetime.utcfromtimestamp(float(line[2]))

    latest_sessions[user]['page_loads'] += 1 #add one to this user's current session

    if latest_sessions[user]['start'] == 0: #put in the start time if there isn't one
        latest_sessions[user]['start'] = timestamp

    if latest_sessions[user]['end'] == 0: #put in the end time if there isn't one
        latest_sessions[user]['end'] = timestamp
    else: #otherwise calculate if the new end time is 20 mins later
        diff = (timestamp - latest_sessions[user]['end']).seconds
        if diff > 1200: #if so, save the session to the database
            db.update({'id':user}, {'$push':{'sessions':latest_sessions[user]}})
            latest_sessions[user] = blank_dict
        else:
            latest_sessions[user]['end'] = timestamp #otherwise just replace this endtime

    count += 1
    if count % 100000 == 0: #output some nice stats every 100,000 lines
        print str(count) + '/' + str(howManyLinesTotal)

#now put the remaining last sessions in
for user in latest_sessions:
    db.update({'id':user}, {'$push':{'sessions':latest_sessions[user]}})

I'm getting about 0.002 seconds per line = 44 hours for an 80 million page view file.

This is with a 2TB 7200rpm seagate HDD, 32 GB of RAM and 3.4Ghz dualcore i3 processor.

Does this time sound reasonable or am I making some horrendous mistakes?


EDIT: We're looking at about 90,000+ users, i.e. keys in the defaultdict


EDIT2: Here's the cProfile output on a much smaller 106mb file. I commented out the actual mongoDB saves for testing purposes: http://pastebin.com/4XGtvYWD


EDIT3: Here's a barchart analysis of the cProfile: https://i.stack.imgur.com/bwMBi.png

LittleBobbyTables
  • 4,361
  • 9
  • 38
  • 67

2 Answers2

4

I can't tell you where you bottleneck is, but I can tell you how to find it. Python has built in profiling tools, which will tell you how much time is being spent in every part of your code. Using this tool for a script, is as simple as running:

python -m cProfile my_db_import_script.py

Where my_db_import_script.py is the name of your actual script. What this command will do is run the script with the profiler attached. After the script is completed, it will print out how many times each function is called, how much time is spent inside them total, and cumulatively, as well as some other statistics.

To use this with your script, you'll want to process a subset of your data that can complete in a reasonable amount of time. From there you'll be able to analyze any bottlenecks that you find.

The key to optimizing code is to never assume that you know where the problem is. Measure first and measure often.

EDIT:

After glancing through your profile results these are the lines that stuck out to me:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)

        1    0.000    0.000    0.000    0.000 gzip.py:149(_init_read)
        1    0.000    0.000    0.000    0.000 gzip.py:153(_read_gzip_header)
  2709407    5.543    0.000    8.898    0.000 gzip.py:200(read)
        2    0.000    0.000    0.000    0.000 gzip.py:23(read32)
  2242878    3.267    0.000    3.727    0.000 gzip.py:232(_unread)
   107984    0.266    0.000    3.310    0.000 gzip.py:237(_read)
        1    0.000    0.000    0.000    0.000 gzip.py:26(open)
   107979    0.322    0.000    1.258    0.000 gzip.py:287(_add_read_data)
        1    0.000    0.000    0.000    0.000 gzip.py:293(_read_eof)
        1    0.000    0.000    0.000    0.000 gzip.py:308(close)
        1    0.000    0.000    0.000    0.000 gzip.py:35(GzipFile)
  2242878    8.029    0.000   23.517    0.000 gzip.py:385(readline)
        1    0.000    0.000    0.000    0.000 gzip.py:4(<module>)
        1    0.000    0.000    0.000    0.000 gzip.py:434(__iter__)
  2242878    1.561    0.000   25.078    0.000 gzip.py:437(next)
        1    0.000    0.000    0.000    0.000 gzip.py:44(__init__)

  2242878    2.889    0.000    2.889    0.000 {built-in method utcfromtimestamp}
   107979    1.627    0.000    1.627    0.000 {built-in method decompress}
  2709408    1.451    0.000    1.451    0.000 {method 'find' of 'str' objects}
  2242880    1.849    0.000    1.849    0.000 {method 'split' of 'str' objects}

You'll notice that I've highlighted all of the gzip code. I wasn't super familiar with the gzip module, so I've been looking through the source code. It looks like what this module is doing is exposing a normal file like interface to gzip'd data. There are a couple ways that you could speed this up.

  1. If it's feasible, you could decompress your file before hand. This would get rid of some of the overhead of gzip.

  2. You can start making optimizations on how your file is read. Here is a link to an example of how things can differ based on how much of a file you read in at once. There are also some good suggestions in this stackoverflow question.

I've also highlighted the fact that converting your timestamp is taking up a fair amount of time, as are your string manipulation functions.

At the end of the day, the best way to go about any optimization on this scale is to run a benchmark, make a change, and re-run. Hope this was instructive!

Community
  • 1
  • 1
Wilduck
  • 13,822
  • 10
  • 58
  • 90
  • Thanks! :) I've added my cProfile on a smaller script with the actual saves commented out. What do you think? – LittleBobbyTables Feb 06 '13 at 19:06
  • Yep, and I can see why you wouldn't want to paste that all in the question. I'll have a look and see if I can provide a couple pointers. – Wilduck Feb 06 '13 at 19:19
  • hehe :D Here's a bar chart I just generated: http://i.imgur.com/K6pu6xx.png. Perhaps UTCfromtimestamp could be eliminated in favor of simply seeing if 1 float is bigger than another (the times im given are unix timestamps). – LittleBobbyTables Feb 06 '13 at 19:24
2

If I understand the cProfile output correctly, the bottleneck is gzip stream reader.

The cumtime column ('time spent on the function, including calls to other functions') shows that about half of the run time (25.078 of 45.390) is spent in gzip.py:437(next). Most of that time is spent in gzip.py:385(readline).

It does not look like disk I/O is the bottleneck, though; more like the unpacking logic itself. Try unpacking the file before feeding it to the program with regular gzip. Note that gzip can unpack to stdout; your program could read it from stdin.

Another function that consumes inordinate amount of time is utcfromtimestamp. Try reworking its logic if possible.

Try this: gunzip gigantic_file.gz - | head -n 100000 > small_unpacked, then feed small_unpacked to your script, opening it as a regular file. Profile again.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
9000
  • 39,899
  • 9
  • 66
  • 104