2

I am working on a system, which will run on GAE, which will have several related entities and I am not sure of the best way to store the data. This post is a request for advice from others who may have similar experience....

The system will have users, with profile data and an image. Those users will be able to create "events" and add journal entries to it. For the purpose of the system, the "events" will likely have 1 or 2 journal entries in them, and anything over 10 would likely never happen. Other users will be able to add comments to users' entries as well, where popular ones may have hundreds or even thousands of comments. When a random visitor uses the system, they should be able to see the latest events (latest, being defined by those with latest journal entries in them), search by tag, and a very perform basic text search. Then upon selecting an event to view, it should be displayed with all journal entries, and all user comments, with user images alongside comments. A user should also have a kind of self-admin page, to view/modify/delete their events and to view/modify/delete comments they have made on other events. So, doing all this on a normal RDBMS would just queries with some big joins across several tables. On GAE it would obviously need to work differently. Here are my initial thoughts on the design of the entities:

  • Event entity - id, name, timstamp, list property of tags, view count, creator's username, creator's profile image id, number of journal entries it contains, number of total comments it contains, timestamp of last update to contained journal entries, list property of index words for search (built/updated from text from contained journal entries)
  • JournalEntry entity - timestamp, journal text, name of event, creator's username, creator's profile image id, list property of comments (containing commenter username and image id)
  • User entity - username, password hash, email, list property of subscribed events, timestamp of create date, image id, number of comments posted, number of events created, number of journal entries created, timestamp of last journal activity
  • UserComment entity - username, id of event commented on, title of event commented on
  • TagData entity - tag name, count of events with tag on them

So, I'd like to hear what people here think about the design and what changes should be made to help it scale well. Thanks!

user605331
  • 3,718
  • 4
  • 33
  • 60

1 Answers1

8
  • Rather than store Event.id as a property, use the id automatically embedded in each entity's key, or set unique key names on entities as you create them.
  • You have lots of options for modeling the relationship between Event and JournalEntry: you could use a ReferenceProperty, you could parent JournalEntries to Events and retrieve them with ancestor queries, or you could store a list of JournalEntry key ids or names on Event and retrieve them in batch with a key query. Try some things out with realistically-distributed dummy data, and use appstats to see what works best.
  • UserComment references an Event, while JournalEntry references a list of UserComments, which is a little confusing. Is there a relationship between UserComment and JournalEntry? or just between UserComment and Event?
  • Persisting so many counts is expensive. When I post a comment, you're going to write a new UserComment entity and also update my User entity and a JournalEntry entity and an Event entity. The number of UserComments you expect per Event makes it unwise to include everything in the same entity group, which means you can't do these writes transactionally, so you'll do them serially, and the entities might be stored across different network nodes, making the whole operation slow; and you'll also be open to consistency problems. Can you do without some of these counts and consider storing others in memcache?
  • When you fetch an Event from the datastore, you don't actually care about its list of search index words, and retrieving and deserializing them from protocol buffers has a cost. You can get around this by splitting each Event's search index words into a separate child EventIndex entity. Then you can query EventIndex on your search term, fetch just the EventIndex keys for EventIndexes that match your search, derive the corresponding Events' keys with key.parent(), and fetch the Events by key, never paying for the retrieval or deserialization of your search index word lists. Brett Slatkin explains this strategy here at 14:35.
  • Updating Event.viewCount will fail if you have a lot of views for any Event in rapid succession, so you should try out counter sharding.

Good luck, and tell us what you learn by trying stuff out.

dfichter
  • 1,078
  • 8
  • 9
  • dfichter,thanks for the terrific advice, it is exactly what I was hoping to get. to answer your question about the counts, I'm sure I can manage without them, as they are mostly there to just help users see where the activity is. Still though, it's a bit of a bummer to have to give them up. You're right though, having to create that many separate transactions to get a single operation done is excessive. I'd be happy to put it all in memcache, but then I can't count on it really being there, since that could go away at any time. Is this a common problem others have solved? – user605331 Feb 06 '11 at 20:52
  • ...I'm marking dfichter's response as "the answer" although of course there is no real answer to my request for advice. – user605331 Feb 06 '11 at 20:56
  • @user605331 sure, the pattern is: when you need the value, check for it in memcache, and if it's not there, compute it from scratch, cache it, and use it; and when you do a datastore write that would affect the value, update the value in memcache. Computing counts from scratch is expensive but hopefully won't happen often, and there's guidance about it [here](http://stackoverflow.com/questions/421751/whats-the-best-way-to-count-results-in-gql/4918133#4918133). – dfichter Feb 07 '11 at 05:04
  • @user605331 Alternatively, you could persist your counts with a non-blocking dispatch to a [task queue](http://code.google.com/appengine/docs/python/taskqueue/), which would make your writes faster but still count against your quotas. I was also thinking some of your counts, like the count of `JournalEntries` for each `Event`, don't even need to be stored in memcache, much less the datastore, because if you're always retrieving `JournalEntries` from the datastore along with their `Event`, you can just count them on the fly. – dfichter Feb 07 '11 at 05:05
  • I was thinking of keeping the Event entity separate from the JournalEntry entity. If Event had a list prop of JournalEntry's, then they would need to be deserialized each time I wanted to display a list of Event's. If they were together, I would be able to use the count, but that seems like a high price to pay for it. I do like the idea of using the task queue to update the various entities though. The counts do not need to be updated in real-time, so that would help keep the system running fast, while still showing useful information. – user605331 Feb 07 '11 at 19:16
  • For using memcache and recalculating each time (vs storing counts) - the cost to recalculate would be extremely high, since I would have to iterate over all entities (of a given type) to do it. As I understand it, there is no guarantee that any data will stay in memcache, so I can't be sure how often I would have to run the recount process. Obviously it would be a bummer if it ran often. – user605331 Feb 07 '11 at 19:20
  • Not all entities of a kind, just the relevant ones, like just the events authored by your user, if that's the count you need to display, right? But yeah, even that is costly if the count is high, and it's hard to know how frequently your counts might be dropped from memcache. If you go that route, you might want to render the page without the count and fire off an ajax request to get it. Cost of possibly re-counting vs. cost of definitely hitting a task queue and persisting more data, go with your gut, I guess. – dfichter Feb 08 '11 at 03:52