0

i'm new to web apps so i'm not so used to worrying about CPU limits, but i looks i am going to have a problem with this code. I read in google's quotas page that i can use 6.5 CPU hours per day an 15 CPU , minutes per minute.

Google Said:

CPU time is reported in "seconds," which is equivalent to the number of CPU cycles that can be performed by a 1.2 GHz Intel x86 processor in that amount of time. The actual number of CPU cycles spent varies greatly depending on conditions internal to App Engine, so this number is adjusted for reporting purposes using this processor as a reference measurement.

And

            Per Day          Max Rate
CPU Time    6.5 CPU-hours    15 CPU-minutes/minute

What i want to Know:

Is this script going over the limit?

(if yes)How can i make it not go over the limit?

I use the urllib library, should i use Google's URL Fetch API? Why?

Absolutely any other helpful comment.

What it does:

It scrapes (crawls) project free TV. I will only completely run it once then replace it with a shorter faster script.

from urllib import urlopen
import re

alphaUrl = 'http://www.free-tv-video-online.me/movies/'
alphaPage = urlopen(alphaUrl).read()
patFinderAlpha = re.compile('<td width="97%" nowrap="true" class="mnlcategorylist"><a href="(.*)">')
findPatAlpha = re.findall(patFinderAlpha,alphaPage)
listIteratorAlpha = []
listIteratorAlpha[:] = range(len(findPatAlpha))
for ai in listIteratorAlpha:
    betaUrl = 'http://www.free-tv-video-online.me/movies/' + findPatAlpha[ai] + '/'
    betaPage = urlopen(betaUrl).read()
    patFinderBeta = re.compile('<td width="97%" class="mnlcategorylist"><a href="(.*)">')
    findPatBeta = re.findall(patFinderBeta,betaPage)
    listIteratorBeta = []
    listIteratorBeta[:] = range(len(findPatBeta))
    for bi in listIteratorBeta:
        gammaUrl = betaUrl + findPatBeta[bi]
        gammaPage = urlopen(gammaUrl).read()
        patFinderGamma = re.compile('<a href="(.*)" target="_blank" class="mnllinklist">')
        findPatGamma = re.findall(patFinderGamma,gammaPage)
        patFinderGamma2 = re.compile('<meta name="keywords"content="(.*)">')
        findPatGamma2 = re.findall(patFinderGamma2,gammaPage)
        listIteratorGamma = []
        listIteratorGamma[:] = range(len(findPatGamma))
        for gi in listIteratorGamma:
            deltaUrl = findPatGamma[gi]
            deltaPage = urlopen(deltaUrl).read()
            patFinderDelta = re.compile("<iframe id='hmovie' .* src='(.*)' .*></iframe>")
            findPatDelta = re.findall(patFinderDelta,deltaPage)
            PutData( findPatGamma2[gi], findPatAlpha[ai], findPatDelt)

If I forgot anything please let me know.

Update:

This is about how many times it will run and why in case this is helpfull in answering the question.

       per cycle      total
Alpha: 1              1
Beta:  16             16
Gamma: ~250           ~4000
Delta: ~6             ~24000
  • 2
    @Jon parsing HTML with Regex is a sin here on Stack Overflow. – systempuntoout Apr 19 '11 at 08:22
  • These are not "limits", they're what you get for free. If you want to run anything that uses significant CPU, you're going to have to pay for your application, same as with any other host. – Wooble Apr 19 '11 at 11:26
  • @systempuntoout Oops, how should it be done? –  Apr 19 '11 at 14:12
  • Don't worry, I've coded many quick&dirty python scrapers with Regex too. [Beautifulsoup](http://www.crummy.com/software/BeautifulSoup/) is one of the possible way to properly parsing and handling Html. – systempuntoout Apr 19 '11 at 14:17
  • @systempuntoout Why are they so bad? –  Apr 20 '11 at 00:34
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – systempuntoout Apr 20 '11 at 10:49
  • @systempuntoout I'VE BEEN LEARNED, i will now now to go a secluded mountain and pray for my forgivenes. :) thanks. –  Apr 20 '11 at 14:20

3 Answers3

2

I don't like to optimize until I need to. First, just try it. It might just work. If you go over quota, shrug, come back tomorrow.

To split jobs into smaller parts, look at the Task Queue API. Maybe you can divide the workload into two queues, one that scrapes pages and one that processes them. You can put limits on the queues to control how aggressively they are run.

P.S. On Regex for HTML: Do what works. The academics will call you out on semantic correctness, but if it works for you, don't let that stop you.

Justin Morgan
  • 2,427
  • 2
  • 16
  • 19
1

I use the urllib library, should i use Google's URL Fetch API? Why?

urlib on AppEngine production servers is The URLFetch API

Chris Farmiloe
  • 13,935
  • 5
  • 48
  • 57
0

It's unlikely that this will go over the free limit, but it's impossible to say without seeing how big the list of URLs it needs to fetch is, and how big the resulting pages are. The only way to know for sure is to run it - and there's really no harm in doing that.

You're more likely to run into the limitations on individual request execution - 30 seconds for frontend requests, 10 minutes for backend requests like cron jobs - than run out of quota. To alleviate those issues, use the Task Queue API to split your job into many parts. As an additional benefit, they can run in parallel! You might also want to look into Asynchronous URLFetch - though it's probably not worth it if this is just a one-off script.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198