0

My Python High Replication Datastore application requires a large lookup table of between 100,000 and 1,000,000 entries. I need to be able to supply a code to some method that will return the value associated with that code (or None if there is no association). For example, if my table held acceptable English words then I would want the function to return True if the word was found and False (or None) otherwise.

My current implementation is to create one parentless entity for each table entry, and for that entity to contain any associated data. I set the datastore key for that entity to be the same as my lookup code. (I put all the entities into their own namespace to prevent any key conflicts, but that's not essential for this question.) Then I simply call get_by_key_name() on the code and I get the associated data.

The problem is that I can't access these entities during a transaction because I'd be trying to span entity groups. So going back to my example, let's say I wanted to spell-check all the words used in a chat session. I could access all the messages in the chat because I'd give them a common ancestor, but I couldn't access my word table because the entries there are parentless. It is imperative that I be able to reference the table during transactions.

Note that my lookup table is fixed, or changes very rarely. Again this matches the spell-check example.

One solution might be to load all the words in a chat session during one transaction, then spell-check them (saving the results), then start a second transaction that would spell-check against the saved results. But not only would this be inefficient, the chat session might have been added to between the transactions. This seems like a clumsy solution.

Ideally I'd like to tell GAE that the lookup table is immutable, and that because of this I should be able to query against it without its complaining about spanning entity groups in a transaction. I don't see any way to do this, however.

Storing the table entries in the memcache is tempting, but that too has problems. It's a large amount of data, but more troublesome is that if GAE boots out a memcache entry I wouldn't be able to reload it during the transaction.

Does anyone know of a suitable implementation for large global lookup tables?

Please understand that I'm not looking for a spell-check web service or anything like that. I'm using word lookup as an example only to make this question clear, and I'm hoping for a general solution for any sort of large lookup tables.

Dragonfly
  • 814
  • 8
  • 12

2 Answers2

1

First, if you're under the belief that a namespace is going to help avoid key collisions, it's time to take a step back. A key consists of an entity kind, a namespace, a name or id, and any parents that the entity might have. It's perfectly valid for two different entity kinds to have the same name or id. So if you have, say, a LookupThingy that you're matching against, and have created each member by specifying a unique name, the key isn't going to collide with anything else.

As for the challenge of doing the equivalent of a spell-check against an unparented lookup table within a transaction, is it possible to keep the lookup table in code?

Or can you think of an analogy that's closer to what you need? One that motivates the need to do the lookup within a transaction?

Dave W. Smith
  • 24,318
  • 4
  • 40
  • 46
  • Yes, you must be right that the separate namespace doesn't add any protection. I had thought this might avoid key collisions, but the kind should do that adequately even within the default namespace. Another example: Say that you wanted to associate data with each ZIP Code; for example, you might have average income figures or other demographics. How would you be able to consult this table from within a transaction? This seems like an extremely general problem. A million entries hard-wired into Python code? I suppose that would work, but it seems ungainly. Is that the only solution? – Dragonfly Sep 17 '11 at 04:43
  • Why does matching against a lookup table have to be done withing a transaction? Are you trying to isolate against infrequent changes to the lookup table? – Dave W. Smith Sep 17 '11 at 04:55
  • There’s not room enough here to explain fully the need for transactions, but it has nothing to do with the lookup table. It’s for the usual reason: I am making changes to an entity group and the data must be treated atomically. I experimented with representing all the table data in Python, and this exceeded the amount of space allowed for code ("soft process size limit"). I guess this was to be expected. The question remains unanswered. – Dragonfly Sep 17 '11 at 22:42
  • What prevents you from querying against the lookup table before you begin the transaction? – Dave W. Smith Sep 19 '11 at 04:48
  • Because I don't have the data at that point. I believe the correct way to use transactions is: start the transaction, do your reads, do your computation, do your writes, end the transaction. If I read before the start of the transaction then I leave an opening for another process to alter the data before my transaction starts. – Dragonfly Sep 19 '11 at 21:21
  • Ah, got it. You need to lock down the entity that you're querying the join table on behalf of. One there is to version your entities, bumping the version on each mutation. The result of the "join" would then record the version at the time the join was done. This lets you do everything (except mutation) outside of a transaction. – Dave W. Smith Sep 19 '11 at 22:20
1

If you can, try and fit the data into instance memory. If it won't fit in instance memory, you have a few options available to you.

You can store the data in a resource file that you upload with the app, if it only changes infrequently, and access it off disk. This assumes you can build a data structure that permits easy disk lookups - effectively, you're implementing your own read-only disk based table.

Likewise, if it's too big to fit as a static resource, you could take the same approach as above, but store the data in blobstore.

If your data absolutely must be in the datastore, you may need to emulate your own read-modify-write transactions. Add a 'revision' property to your records. To modify it, fetch the record (outside a transaction), perform the required changes, then inside a transaction, fetch it again to check the revision value. If it hasn't changed, increment the revision on your own record and store it to the datastore.

Note that the underlying RPC layer does theoretically support multiple independent transactions (and non-transactional operations), but the APIs don't currently expose any way to access this from within a transaction, short of horrible (and I mean really horrible) hacks, unfortunately.

One final option: You could run a backend provisioned with more memory, exposing a 'SpellCheckService', and make URLFetch calls to it from your frontends. Remember, in-memory is always going to be much, much faster than any disk-based option.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • +1 for suggesting putting it in memory. For fixed data this is likely to be optimal. – SingleNegationElimination Sep 19 '11 at 04:36
  • With regard to your first suggestion, do I understand correctly that I could keep the table in the datastore, then read the entire table into instance memory before beginning my transaction? That would work, but the table occupies something like 3MB and I will require only a handful of table lookups. Thus, I'd be reading in 3MB when all I require are a few hundred bytes, which sounds inefficient. Or are you suggesting something else? There's no way for instance memory to survive requests (except by writing to the memcache) is there? (And thank you for your reply -- on a Sunday night no less!) – Dragonfly Sep 19 '11 at 21:51
  • As for your other ideas, I understand the "revision property" idea and could go with that. I haven't yet learned about blobstore or backends so I'll research those next. I like the sound of using a backend; are there any disadvantages to that solution? P.S. I still think having GAE support immutable datastore entities that can be accessed by any transaction is a good idea! – Dragonfly Sep 19 '11 at 21:58
  • @DragonFly Instances persist for multiple requests, and as a result, so do all global and class variables (anything that isn't request-scoped, basically). You can read the data in on startup, and refer to it repeatedly for as long as the instance persists. – Nick Johnson Sep 20 '11 at 00:04
  • Ah! I've learned something new. This is excellent news. I'd give you a +1 but I'm too new to SO for that. You have my thanks, however. – Dragonfly Sep 20 '11 at 02:03
  • Following your suggestion, my program now keeps my table of data in an instance variable, and it works fine. As I understand it, if GAE decides to create multiple instances of my application then there will be multiple instance variables independently storing the table data. But here's where I have trouble. Let's say that once a year my data change, which means that my instances need to have new data. How can I push the update to all instances? Is there a way to, say, set a variable to None across all instances? Or to terminate all instances? – Dragonfly Jan 03 '12 at 07:50
  • @DragonFly Just deploy a new version of your app (even if nothing has changed code-wise) - that will cause a new set of instances to be started with the new version, replacing the old ones. – Nick Johnson Jan 03 '12 at 09:58
  • Hmmm... on reflection I think this will not quite work for my situation. The person who updates the table won't have access to the code and won't be able to deploy. What I really need is a programmatic solution such that after the data in the table are modified an "Update" button can be clicked and the new data will propagate to all instances. Or all instances are restarted. Is there any way to trigger this within Python/GAE code? – Dragonfly Jan 03 '12 at 10:44
  • @Dragonfly No. All you can do is cache the data instead of storing it permanently, and re-fetch it when it expires. – Nick Johnson Jan 03 '12 at 11:40