GAE Python code MUCH slower in production than locally

Question

In my Python GAE app, the following snippet of code is MUCH slower in production than when run locally. The processing goes like this:

A text file of about 1 MB is loaded in a POST. Each line of the text file is an "item".
My code creates a list of items from the text file and checks for duplicates and validity (by comparing against a compiled RE).

Here is the code:

def process_items(self, text):
    item_list = text.split()
    item_set = set()
    n_valid = 0
    n_invalid = 0
    n_dups = 0
    out = ""
    for item in item_list:
        if item in item_set:
            n_dups += 1
            out += "DUPLICATE: %s\n" % item
        elif valid_item(item): # This compares against a compiled RE
            item_set.add(item)
            n_valid += 1
            out += "%s\n" % item
        else:
            n_invalid += 1
            out += "INVALID: %s\n" % item
    return out

When I run this on the local dev server, a 1MB file of 50,000 lines takes 5 seconds to process.

When I run this in production, the same file takes over a minute and the request times out. The file upload only takes about a second so I know the bottle neck is the above code.

In the past, production code was about the same speed as my local code. I don't think this code has changed, so I suspect there may have been a change on Google's end.

Any idea why this code is now much slower in production? Anything I can do to make this code faster? I need to return an annotated file to the user that indicates which lines are duplicates and which lines are invalid.

EDIT:

In response to mgilson's comment, I tried the following code, and it made a huge difference in execution time! The processing that previously timed out after a minute now takes only about 5 seconds. GAE is still slower than expected (even accounting the relatively slow server CPUs), but with the improved algorithm, it doesn't matter for me now.

def process_items(self, text):
    item_list = text.split()
    item_set = set()
    n_valid = 0
    n_invalid = 0
    n_dups = 0
    for i, item in enumerate(item_list):
        item = item.strip()
        if item in item_set:
            n_dups += 1
            item_list[i] = "DUPLICATE: %s" % item
        elif valid_item(item): # This compares against a compiled RE
            item_set.add(item)
            n_valid += 1
            item_list[i] = item
        else:
            n_invalid += 1
            item_list[i] = "INVALID: %s" % item
    return "\n".join(item_list)

Comparing local running and running on GAE isn't really fair. Depending on your set [instance class](https://cloud.google.com/appengine/docs/about-the-standard-environment#instance_classes), you could have a CPU limit as low as 600MHz. Most personally computers are _significantly_ faster than that now. One immediately obvious optimization that _might_ help is to accumulate the results in a list and `return "".join(results)` at the end rather than using `+=`. See [Why is ''.join() faster than += in Python?](http://stackoverflow.com/q/39312099/748858) for example ... — mgilson, Sep 14 '16 at 15:50
Making `process_items` a generator `yielding` one line at a time would also speed up its overall processing time by obviating the slow `+=` — Craig Burgler, Sep 14 '16 at 16:00
@mgilson, My Mac is 2.2 GHz so it is 3.7x faster than GAE. For this Python code, my Mac is at least 12x faster than GAE. That still seems like a big discrepancy though I realize that many factors make it an imperfect comparison. — new name, Sep 14 '16 at 17:57
Is the time taken consistent from run to run.Also is an instance running or do you have startup time in the mix. Try creating a new appengine instance and test you performance there. You may be consistently on a node that is being hammered by other services. This is a straight out CPU task. — Tim Hoffman, Sep 14 '16 at 22:10
@TimHoffman, I do it a few times in a row on an instance used only by me so startup time is not a factor. It has been consistent between yesterday and today and over several subsequent runs each time. — new name, Sep 15 '16 at 11:55

score 4 · Accepted Answer · answered Sep 15 '16 at 17:15

It's not at all unexpected that GAE production would run slower than locally -- Depending on your instance class, your production CPU can be throttled as low as 600MHz which is significantly slower than most developer computers.

One thing you can do to speed things up is to accumulate your results in a list (or yield them from a generator) and then use str.join to get the full result:

def process_items(self, text):
    item_list = text.split()
    item_set = set()
    n_valid = 0
    n_invalid = 0
    n_dups = 0
    out = []
    for item in item_list:
        if item in item_set:
            n_dups += 1
            out.append("DUPLICATE: %s\n" % item)
        elif valid_item(item): # This compares against a compiled RE
            item_set.add(item)
            n_valid += 1
            out.append("%s\n" % item)
        else:
            n_invalid += 1
            out.append("INVALID: %s\n" % item)
    return "".join(out)

GAE Python code MUCH slower in production than locally

1 Answers1