4

I need to bulk-load all entities in a table. (They need to be in memory rather than loaded as-needed, for high-speed on-demand graph-traversal algorithms.)

I need to parallelize this for speed in loading. So, I want to run multiple queries in parallel threads, each pulling approx. 800 entities from the database.

QuerySplitter serves this purpose, but we are running on Flexible Environment and so are using the Appengine SDK rather than the Client libraries.

MapReduce has been mentioned, but that is not aimed at simple dataloading into memory. Memcache is somewhat relevant, but for high speed access I need all these objects in a dense network in the RAM of my own app's JVM.

MultiQueryBuilder might do this. It offers parallelism in running parts of a query in parallel.

Whichever of these three approaches, or some other approach, is used, the hardest part is to define filters or some other form of spilts that roughly partition the table (the Kind) into chunks of 800 or so entities? I would create filters that say "objects 1 through 800", "801 through 1600,...", but I know that that is impractical. So, how does one do it?

Joshua Fox
  • 18,704
  • 23
  • 87
  • 147

1 Answers1

1

I solved a similar problem by partitioning the entities into random groups.

I added a float property to each datastore entity, and assigned it a random number between 0 and 1 every time I saved the entity. Then, when launching the N threads to do the work on various datastore entities, I had each thread work over a query of 1/N of the entities. For example, thread 0 would handle all entities which had its random property set between 0 and 1/N. Thread 2 would handle all entities that had their random property between 1/N and 2/N, etc.

The downside to this is that it is not entirely deterministic and you need to add a new property to your datastore entities. The upside is that it can easily scale to millions of entities and threads and you generally get an even distribution of work across the threads.

speedplane
  • 15,673
  • 16
  • 86
  • 138
  • Thank you. That could work. It seems strange that we have to add an otherwise useless field for functionality that is standard in many databases. I wonder if there is a builtin way to do this? – Joshua Fox May 25 '16 at 12:21
  • Yes, when I first considered it (someone else suggested it to me), it also felt strange. However, consider that the Google database is designed to be distributed and handle a nearly infinite amount of data. On the google database, you cannot easily determine the total number of elements in a table (there is no `len` function). If Google implemented a partitioning algorithm, it would need to keep track of the number of elements in the table. Given this constraint and others, the random field actually makes a good bit of sense. – speedplane May 25 '16 at 15:21
  • One more thing: If you want it to be deterministic, you could also use a hash function on the entity (i.e., turn each entity into a semi-random number). However, you'll need to design the hash function so it's evenly distributed across 0 to 1 (or whatever range you're using). – speedplane May 25 '16 at 15:24
  • Thank you. Actually, QuerySplitter does determine splits, using some randomness although in a different way. Your random number approach also makes sense, but you would think that Google could implement it internally. – Joshua Fox May 26 '16 at 05:40