How can i make this code run smoothly on google app engine?

Question

i'm new to web apps so i'm not so used to worrying about CPU limits, but i looks i am going to have a problem with this code. I read in google's quotas page that i can use 6.5 CPU hours per day an 15 CPU , minutes per minute.

Google Said:

CPU time is reported in "seconds," which is equivalent to the number of CPU cycles that can be performed by a 1.2 GHz Intel x86 processor in that amount of time. The actual number of CPU cycles spent varies greatly depending on conditions internal to App Engine, so this number is adjusted for reporting purposes using this processor as a reference measurement.

And

            Per Day          Max Rate

CPU Time    6.5 CPU-hours    15 CPU-minutes/minute

What i want to Know:

Is this script going over the limit?

(if yes)How can i make it not go over the limit?

I use the urllib library, should i use Google's URL Fetch API? Why?

Absolutely any other helpful comment.

What it does:

It scrapes (crawls) project free TV. I will only completely run it once then replace it with a shorter faster script.

from urllib import urlopen
import re

alphaUrl = 'http://www.free-tv-video-online.me/movies/'
alphaPage = urlopen(alphaUrl).read()
patFinderAlpha = re.compile('<td width="97%" nowrap="true" class="mnlcategorylist"><a href="(.*)">')
findPatAlpha = re.findall(patFinderAlpha,alphaPage)
listIteratorAlpha = []
listIteratorAlpha[:] = range(len(findPatAlpha))
for ai in listIteratorAlpha:
    betaUrl = 'http://www.free-tv-video-online.me/movies/' + findPatAlpha[ai] + '/'
    betaPage = urlopen(betaUrl).read()
    patFinderBeta = re.compile('<td width="97%" class="mnlcategorylist"><a href="(.*)">')
    findPatBeta = re.findall(patFinderBeta,betaPage)
    listIteratorBeta = []
    listIteratorBeta[:] = range(len(findPatBeta))
    for bi in listIteratorBeta:
        gammaUrl = betaUrl + findPatBeta[bi]
        gammaPage = urlopen(gammaUrl).read()
        patFinderGamma = re.compile('<a href="(.*)" target="_blank" class="mnllinklist">')
        findPatGamma = re.findall(patFinderGamma,gammaPage)
        patFinderGamma2 = re.compile('<meta name="keywords"content="(.*)">')
        findPatGamma2 = re.findall(patFinderGamma2,gammaPage)
        listIteratorGamma = []
        listIteratorGamma[:] = range(len(findPatGamma))
        for gi in listIteratorGamma:
            deltaUrl = findPatGamma[gi]
            deltaPage = urlopen(deltaUrl).read()
            patFinderDelta = re.compile("<iframe id='hmovie' .* src='(.*)' .*></iframe>")
            findPatDelta = re.findall(patFinderDelta,deltaPage)
            PutData( findPatGamma2[gi], findPatAlpha[ai], findPatDelt)

If I forgot anything please let me know.

Update:

This is about how many times it will run and why in case this is helpfull in answering the question.

       per cycle      total

Alpha: 1              1

Beta:  16             16

Gamma: ~250           ~4000

Delta: ~6             ~24000

@Jon parsing HTML with Regex is a sin here on Stack Overflow. — systempuntoout, Apr 19 '11 at 08:22
These are not "limits", they're what you get for free. If you want to run anything that uses significant CPU, you're going to have to pay for your application, same as with any other host. — Wooble, Apr 19 '11 at 11:26
Don't worry, I've coded many quick&dirty python scrapers with Regex too. [Beautifulsoup](http://www.crummy.com/software/BeautifulSoup/) is one of the possible way to properly parsing and handling Html. — systempuntoout, Apr 19 '11 at 14:17
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — systempuntoout, Apr 20 '11 at 10:49
@systempuntoout I'VE BEEN LEARNED, i will now now to go a secluded mountain and pray for my forgivenes. :) thanks. — , Apr 20 '11 at 14:20

score 2 · Accepted Answer · answered Apr 20 '11 at 22:49

I don't like to optimize until I need to. First, just try it. It might just work. If you go over quota, shrug, come back tomorrow.

To split jobs into smaller parts, look at the Task Queue API. Maybe you can divide the workload into two queues, one that scrapes pages and one that processes them. You can put limits on the queues to control how aggressively they are run.

P.S. On Regex for HTML: Do what works. The academics will call you out on semantic correctness, but if it works for you, don't let that stop you.

score 1 · Answer 2 · answered Apr 19 '11 at 08:37

1

I use the urllib library, should i use Google's URL Fetch API? Why?

urlib on AppEngine production servers is The URLFetch API

answered Apr 19 '11 at 08:37

Chris Farmiloe

13,935
5
48
57

OK, there is half the question, do you have any idea how (if i do need to) i can split it into parts so it stays within the free limit. – Apr 19 '11 at 14:17
No, it's a wrapper for the URLFetch API. – Nick Johnson Apr 21 '11 at 03:19

score 0 · Answer 3 · answered Apr 21 '11 at 03:22

It's unlikely that this will go over the free limit, but it's impossible to say without seeing how big the list of URLs it needs to fetch is, and how big the resulting pages are. The only way to know for sure is to run it - and there's really no harm in doing that.

You're more likely to run into the limitations on individual request execution - 30 seconds for frontend requests, 10 minutes for backend requests like cron jobs - than run out of quota. To alleviate those issues, use the Task Queue API to split your job into many parts. As an additional benefit, they can run in parallel! You might also want to look into Asynchronous URLFetch - though it's probably not worth it if this is just a one-off script.

How can i make this code run smoothly on google app engine?

3 Answers3