0

In Google App Engine, one cannot store (as a whole) an object larger than 1 MB in memcache.

Say I want to cache the results of a datastore query, which consists of 1000 records of 5 KB each - with a total of ~5 MB.

How to proceed? Can I cache this data in the Python process of my web application, instead of using memcache? For example, in a global variable?

Please find my answer below. Let me know what you think.

turdus-merula
  • 8,546
  • 8
  • 38
  • 50

1 Answers1

2

Google App Engine may resolve different web requests to different processes or even different physical machines. Which means that it's a bit harder to maintain global state among different requests, that is, to implement local caches of the data.

When data modifications happen, you have to be careful to invalidate the local caches - on all processes (cache coherence issue).

Furthermore, if your GAE application is defined as threadsafe, a single process could handle multiple requests at the same time, in different threads.

I sketched a possible solution:

  • keep the data in-process using a global dictionary
  • keep track of the version of the in-process data using a global dictionary
  • keep the gold version of the data in a tiny memcache record (only the version tag, not the actual data, of course)
  • when in-process local data is stale (invalid), fetch it from the gold storage (via the value_provider function)
  • when appropriate, invalidate in-process data among all machines (by resetting the gold version tag).

Here is the code:

import threading
from uuid import uuid4
from google.appengine.api import memcache

_data = dict()
_versions = dict()
lock = threading.Lock()

TIME = 60 * 10  # 10 minutes


def get(key, value_provider):
    """
    Gets a value from the in-process storage (cache).
    If the value is not available in the in-process storage
    or it is invalid (stale), then it is fetched by calling the 'value provider'.
    """
    # Fast check, read-only step (no critical section).
    if _is_valid(key):
        return _data[key]

    # Data is stale (invalid). Perform read+write step (critical section).
    with lock:
        # Check again in case another thread just left the critical section
        # and brought the in-process data to a valid state.
        if _is_valid(key):
            return _data[key]

        version = memcache.get(key)

        # If memcache entry is not initialized
        if not version:
            version = uuid4()
            memcache.set(key, version, time=TIME)

        _data[key] = value_provider()
        _versions[key] = version

    return _data[key]


def _is_valid(key):
    """Whether the in-process data has the latest version (according to memcache entry)."""
    memcache_version = memcache.get(key)
    proc_version = _versions.get(key, None)
    return memcache_version and memcache_version == proc_version


def invalidate(key):
    """Invalidates the in-process cache for all processes."""
    memcache.set(key, uuid4(), time=TIME)

References:

https://softwareengineering.stackexchange.com/a/222818

Understanding global object persistence in Python WSGI apps

Problem declaring global variable in python/GAE

Python Threads - Critical Section

https://en.wikipedia.org/wiki/Cache_coherence

Community
  • 1
  • 1
turdus-merula
  • 8,546
  • 8
  • 38
  • 50