21

I have written a Google App Engine application that programatically generates a bunch of HTML code that is really the same output for each user who logs into my system, and I know that this is going to be in-efficient when the code goes into production. So, I am trying to figure out the best way to cache the generated pages.

The most probable option is to generate the pages and write them into the database, and then check the time of the database put operation for a given page against the time that the code was last updated. Then, if the code is newer than the last put to the database (for a particular HTML request), new HTML will be generated and served, and cached to the database. If the code is older than the last put to the database, then I will just get the HTML direct from the database and serve it (therefore avoiding all the CPU wastage of generating the HTML). I am not only looking to minimize load times, but to minimize CPU usage.

However, one issue that I am having is that I can't figure out how to programatically check when the version of code uploaded to the app engine was updated.

I am open to any suggestions on this approach, or other approaches for caching generated html.

Note that while memcache could help in this situation, I believe that it is not the final solution since I really only need to re-generate html when the code is updated (as opposed to every time the memcache expires).

tshepang
  • 12,111
  • 21
  • 91
  • 136
Alexander Marquardt
  • 1,539
  • 15
  • 30
  • 5
    But... that's what memcache is for! Unless generating the HTML takes a really, really long time, you're overthinking it. – Jonathan Feinberg Dec 18 '09 at 22:16
  • Also, it seems that the App Engine memcache model only caches datastore accesses, not code generation: from: http://code.google.com/appengine/docs/python/memcache/usingmemcache.html -- Memcache is typically used with the following pattern: The application receives a query from the user or the application. The application checks whether the data needed to satisfy that query is in memcache. If the data is in memcache, the application uses that data. If the data is not in memcache, the application queries the datastore and stores the results in memcache for future requests. – Alexander Marquardt Dec 18 '09 at 22:28
  • 2
    @Alexander - You can put whatever you want in memcache, what they are mentionning is the typical use case. – Gab Royer Dec 19 '09 at 01:17
  • 1
    +1 - Good question. I can't believe (continually) the lack of any sort of upvotes on questions. A good question is as valuable as a good answer since the question is usually how folks start searching for the answers... – Mat Nadrofsky Dec 21 '09 at 15:52
  • I had another use case for memcache: I read a .yaml settings file on every request - this turned out to be too heavy (file locks or something, opening of files and subsequently the request often timed out after 60 seconds!1), so I just used memcache for caching the contents of the settings file. The pattern is the same as Alexander Marquardt describes. – Jonny May 16 '14 at 04:01

5 Answers5

6

In order of speed:

  1. memcache
  2. cached HTML in data store
  3. full page generation

Your caching solution should take this into account. Essentially, I would probably recommend using memcache anyways. It will be faster than accessing the data store in most cases and when you're generating a large block of HTML, one of the main benefits of caching is that you potentially didn't have to incur the I/O penalty of accessing the data store. If you cache using the data store, you still have the I/O penalty. The difference between regenerating everything and pulling from cached html in the data store is likely to be fairly small unless you have a very complex page. It's probably better to get a bunch of very fast cache hits off memcache and do a full regenerate every once in a while than to make a call out to the data store every time. There's nothing stopping you from invalidating the cached HTML in memcache when you update, and if your traffic is high enough to warrant it, you can always do a multi-level caching system.

However, my main concern is that this is premature optimization. If you don't have the traffic yet, keep caching to a minimum. App Engine provides a set of really convenient performance analysis tools, and you should be using those to identify bottlenecks after you've got at least a few QPS of traffic.

Anytime you're doing performance optimization, measure first! A lot of performance "optimizations" turn out to either be slower than the original, exactly the same, or they have negative user experience characteristics (like stale data). Don't optimize until you're certain you have to.

Bob Aman
  • 32,839
  • 9
  • 71
  • 95
  • Hi Bob, Thanks for your feedback! Will take into account your suggestions! – Alexander Marquardt Dec 19 '09 at 01:41
  • definitely benchmark memcached if you want to put largeish amounts of data in it. on app engine it pickles everything with a pure python pickle implementation iiirc, this *could* end up pretty slow. – tosh Dec 20 '09 at 10:57
  • If the object he's inserting is raw pre-rendered HTML, that shouldn't matter. – Bob Aman Dec 21 '09 at 18:35
5

A while ago I wrote a series of blog posts about writing a blogging system on App Engine. You may find the post on static generation of HTML pages of particular interest.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
1

Just serve a static version of your site

It's actually a lot easier than you think.

If you already have a file that contains all of the urls for your site (ex urls.py), half the work is already done.

Here's the structure:

+-/website
+--/static
+---/html
+--/app/urls.py
+--/app/routes.py
+-/deploy.py

/html is where the static files will be served from. urls.py contains a list of all the urls for your site. routes.py (if you moved the routes out of main.py) will need to be modified so you can see the dynamically generated version locally but serve the static version in production. deploy.py is your one-stop static site generator.

How you layout your urls module depends. I personally use it as a one-stop-shop to fetch all the metadata for a page but YMMV.

Example:

main = [
  { 'uri':'about-us', 'url':'/', 'template':'about-us.html', 'title':'About Us' }
]

With all of the urls for the site in a structured format it makes crawling your own site easy as pie.

The route configuration is a little more complicated. I won't go into detail because there are just too many different ways this could be accomplished. The important piece is the code required to detect whether you're running on a development or production server.

Here it is:

# Detect whether this the 'Development' server
DEV = os.environ['SERVER_SOFTWARE'].startswith('Dev')

I prefer to put this in main.py and expose it globally because I use it to turn on/off other things like logging but, once again, YMMV.

Last, you need the crawler/compiler:

import os
import sys
import urllib2
from app.urls import main

port = '8080'
local_folder = os.getcwd() + os.sep + 'static' + os.sep + 'html' + os.sep
print 'Outputting to: ' + local_folder

print '\nCompiling:'
for page in main:
  http = urllib2.urlopen('http://localhost:' + port + page['url'])
  file_name = page['template']
  path = local_folder + file_name
  local_file = open(path, 'w')
  local_file.write(http.read())
  local_file.close()
  print ' - ' + file_name + ' compiled successfully...'

This is really rudimentary stuff. I was actually stunned with how easy it was when I created it. This is literally the equivalent of opening your site page-by-page in the browser, saving as html, and copying that file into the /static/html folder.

The best part is, the /html folder works like any other static folder so it will automatically be cached and the cache expiration will be the same as all the rest of your static files.

Note: This handles a site where the pages are all served from the root folder level. If you need deeper nesting of folders it'll need a slight modification to handle that.

Evan Plaice
  • 13,944
  • 6
  • 76
  • 94
  • Hi Evan. While useful, your answer makes potentially incorrect assumptions about Alex's environment. If he's building a content management system where admins can edit the content, independent of the application, then the use of static files is obviously not going to solve this problem. Consider avoiding answers that don't actually answer the question. Some community members will downvote those. If you know how to cache generated HTML, I encourage you to elaborate. Still, I think this is useful information, especially the steps to convert generated HTML into static HTML via the browser source. – jamesmortensen May 11 '12 at 15:55
  • @jmort253 Actually, it's completely possible to edit the content and display it dynamically on the development server. Ideally, pushing content to the production server involves a 1-click compile/deploy cycle. The dynamic routes (ie, editing/admin facilities) and dynamic site files are available on the development server whereas, the production server points to the static (ie static/html) version of the site. On the production side, everything remains static so everything is automatically cached. – Evan Plaice May 11 '12 at 16:26
  • (cont) This may not be a standard/traditional approach but it works. I'm currently using it for 2 sites where the development server is GAE and the production is Apache (ie using regex routes to set the correct file structure). The same can be done for an all-in-one GAE production/development setup and the version numbers will remain in sync if the compiler is set to fire as a GAE deployment hook. See https://developers.google.com/appengine/articles/hooks. The best way to minimize CPU cycles is to never launch dynamic code in production... – Evan Plaice May 11 '12 at 16:36
  • (cont) BTW, all static files are automatically cached. You just need to set the defult_expiration parameter in app.yaml. See http://stackoverflow.com/questions/2642432/google-app-engine-how-to-disable-cache-on-static-files-or-make-cache-smart. – Evan Plaice May 11 '12 at 16:50
  • Hi Evan. I'm not 100% clear. Not because of your explanation but because of my inexperience with Python. Are you running a production site from the development server? Or are you serving static files on Apache that then have JS code that makes AJAX requests to a GAE server? Also, thank you for sending the link to the hooks. I'll need to read that a few times to really soak it in and understand how that all applies here. I'm currently looking for a dynamic caching solution on GAE myself, but on the Java SDK; however, the concepts may of course apply there as well. – jamesmortensen May 11 '12 at 17:20
  • @jmort253 Here's the simplest explanation I can think of. Create a full dynamic site in GAE. Crawl the site using python and download the public pages (ie not editing/admin) as .html files into the static/html folder. Upload static folder to Apache server using FTP. Modify the routes on the Apache server (preferrably in httpd.conf) to match the file structure used in the GAE version. Dynamic stuff (ie comments) *can* be handled by a combination of SaaS and AJAX/JSONP requests. Save on CPU cycles by doing the work on the client. – Evan Plaice May 11 '12 at 19:10
  • (cont) The GAE/GAE workflow is the same, you just setup a set of alternate routes that serve up the static .html files on the deployed instances. – Evan Plaice May 11 '12 at 19:17
  • So if I get what you're saying, your GAE app uploads dynamically generated files to an Apache server via FTP as static HTML files? So your admins can still edit content, but the saved content is stored on the Apache server for fast retrieval? If that's what you're suggesting, then that's an awesome, creative solution. +1 Thank you for taking the time to explain this further, I really appreciate it. You should put that in your answer at the top of your answer in bold, in case the mods come through and remove the comments ;) I think you might lose people in the first line, like you did me :) – jamesmortensen May 11 '12 at 19:23
  • Yes except GAE doesn't deploy/upload the files to the Apache server on its own. I do that myself locally with a deploy.py script found in the root of the project. Deploy.py includes the html compiler code above and a FTP uploader that transfers the contents of my static folder to the Apache server. – Evan Plaice May 17 '12 at 18:44
1

Old thread, but i'll comment anyways as technology has progressed a little... Another idea that may or may not be approproate for you is to generate the HTML and store it on Google Cloud Storage. Then access the HTML via a CDN link that the cloud storage provides for you. No need to check memcache or wait for datastore to wake up on new requests. Ive started storing all my JavaScript, CSS, and other static content (images, downloads etc) like this for my appengine apps and its working well for me.

Joe Bourne
  • 1,144
  • 10
  • 18
  • 1
    To add to this, ingress/egress between cloud services within the same region is free: https://cloud.google.com/storage/pricing . So, as long as you're not allowing the user to access it directly you're just paying for the storage. This might be preferable if you don't want to waste any of your memcache bandwidth on your static assets. – Dustin Oprea Aug 27 '16 at 03:22
1

This is not a complete solution, but might offer some interesting option for caching.

Google Appengine Frontend Caching allows you a way of caching without using memcache.

Albert
  • 3,611
  • 3
  • 28
  • 52