My Python High Replication Datastore application requires a large lookup table of between 100,000 and 1,000,000 entries. I need to be able to supply a code to some method that will return the value associated with that code (or None if there is no association). For example, if my table held acceptable English words then I would want the function to return True if the word was found and False (or None) otherwise.
My current implementation is to create one parentless entity for each table entry, and for that entity to contain any associated data. I set the datastore key for that entity to be the same as my lookup code. (I put all the entities into their own namespace to prevent any key conflicts, but that's not essential for this question.) Then I simply call get_by_key_name() on the code and I get the associated data.
The problem is that I can't access these entities during a transaction because I'd be trying to span entity groups. So going back to my example, let's say I wanted to spell-check all the words used in a chat session. I could access all the messages in the chat because I'd give them a common ancestor, but I couldn't access my word table because the entries there are parentless. It is imperative that I be able to reference the table during transactions.
Note that my lookup table is fixed, or changes very rarely. Again this matches the spell-check example.
One solution might be to load all the words in a chat session during one transaction, then spell-check them (saving the results), then start a second transaction that would spell-check against the saved results. But not only would this be inefficient, the chat session might have been added to between the transactions. This seems like a clumsy solution.
Ideally I'd like to tell GAE that the lookup table is immutable, and that because of this I should be able to query against it without its complaining about spanning entity groups in a transaction. I don't see any way to do this, however.
Storing the table entries in the memcache is tempting, but that too has problems. It's a large amount of data, but more troublesome is that if GAE boots out a memcache entry I wouldn't be able to reload it during the transaction.
Does anyone know of a suitable implementation for large global lookup tables?
Please understand that I'm not looking for a spell-check web service or anything like that. I'm using word lookup as an example only to make this question clear, and I'm hoping for a general solution for any sort of large lookup tables.