In my Python GAE app, the following snippet of code is MUCH slower in production than when run locally. The processing goes like this:
- A text file of about 1 MB is loaded in a POST. Each line of the text file is an "item".
- My code creates a list of items from the text file and checks for duplicates and validity (by comparing against a compiled RE).
Here is the code:
def process_items(self, text):
item_list = text.split()
item_set = set()
n_valid = 0
n_invalid = 0
n_dups = 0
out = ""
for item in item_list:
if item in item_set:
n_dups += 1
out += "DUPLICATE: %s\n" % item
elif valid_item(item): # This compares against a compiled RE
item_set.add(item)
n_valid += 1
out += "%s\n" % item
else:
n_invalid += 1
out += "INVALID: %s\n" % item
return out
When I run this on the local dev server, a 1MB file of 50,000 lines takes 5 seconds to process.
When I run this in production, the same file takes over a minute and the request times out. The file upload only takes about a second so I know the bottle neck is the above code.
In the past, production code was about the same speed as my local code. I don't think this code has changed, so I suspect there may have been a change on Google's end.
Any idea why this code is now much slower in production? Anything I can do to make this code faster? I need to return an annotated file to the user that indicates which lines are duplicates and which lines are invalid.
EDIT:
In response to mgilson's comment, I tried the following code, and it made a huge difference in execution time! The processing that previously timed out after a minute now takes only about 5 seconds. GAE is still slower than expected (even accounting the relatively slow server CPUs), but with the improved algorithm, it doesn't matter for me now.
def process_items(self, text):
item_list = text.split()
item_set = set()
n_valid = 0
n_invalid = 0
n_dups = 0
for i, item in enumerate(item_list):
item = item.strip()
if item in item_set:
n_dups += 1
item_list[i] = "DUPLICATE: %s" % item
elif valid_item(item): # This compares against a compiled RE
item_set.add(item)
n_valid += 1
item_list[i] = item
else:
n_invalid += 1
item_list[i] = "INVALID: %s" % item
return "\n".join(item_list)