1

I have a requirement to send notifications to users a few days before a certain deadline. My current design is to use Cloud Scheduler to invoke a job periodically and then look for all the matching records within a small interval into the feature for the deadline and then alert them. Everything works fine except the fact that I need to create an index on time which is not a good idea for scaling as indexing monotonically increasing values can create hotspotting. So, what is the best scalable design choice for this valid business requirement? Note that I am already familiar with prefixing the timestamps with some values so the ordering is tracked within a partition. However, the schedule job needs to discover all the deadlines and it's simply not possible to iterate each partition because over time the number of upcoming deadlines can be much smaller than the number of partitions.

Siva
  • 1,096
  • 7
  • 20
  • What database technology are you using? Reading here ... https://cloud.google.com/bigtable/docs/schema-design-time-series ... there is a story that Google's BigTable database seems to support time-series data. I'm not a data specialist but I wonder if there is even actually a problem here. I get the sense that you are running only a couple of queries a day. How many inserts (new users) happen per day (hour)? – Kolban Sep 28 '19 at 16:52
  • I am using Cloud Datastore – Siva Sep 28 '19 at 18:26

2 Answers2

2

IMHO your particular usage context is not as sensitive to the hotspotting problem as you think.

The problem is really applicable when you have a very high read/write rate to a narrow range of lexicographically close documents - in your case the timestamp-based notification schedule index.

The writes to the index happen when you modify the schedules. As you mentioned in the comments this is not a big concern as you can easily control the rate of such updates.

Even if you occasionally have widespread schedule updates I'd also submit that index update delay spikes due to contention on such occasions aren't so much a concern: you mentioned the notifications are sent days ahead of the deadlines - who cares if such notifications are delayed by a few minutes? Also - on the read side you can use keys_only queries followed by direct key lookups to eliminate eventual-consistency (which would be sensitive to contention/index update delays).

As for the read side - you typically have a single read operation(s) on the index at one time - the query obtaining the list of notifications to be sent for one interval. Once the job obtains the list the index doesn't normally need to be accessed anymore: the job just schedules the subsequent jobs responsible for sending the notifications in the list (which may access the respective notification entities, but not the index itself!). I'm assuming, of course, that you schedule these querying jobs comfortably apart from each-other.

Even if you use cursors to account for cases in which the the number of notifications in the list returned by the query would be too high to handle in one shot (each cursors use would need to re-create the original query context, most likely accessing the index again) - you can stagger such operations to limit the number of concurrent ones or to make them entirely non-overlapping. See How to delete all the entries from google datastore? for an example of such staggering using GAE deferred tasks (a similar approach is feasible using the now available, more generic Cloud Tasks, which support scheduled times in the future).

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
  • Dan, all your points are valid. My concern is at the time of initially inserting the transactions. I set the deadline on the entities at the time of insert and all inserts at a given instance will have the same future deadline. So if there is a high burst of transaction creation, then I will have the problem "Creates new entities with an indexed and monotonically increasing property value, like a timestamp, at a very high rate." – Siva Sep 30 '19 at 14:53
  • You're not gonna consume the result of those transactions right away, so there shouldn't be a problem staggering those transactions: just split the todo items in batches and schedule tasks with different progressive delays to process each batch at progressively later moments, thus limiting the actual inserting transaction rates. – Dan Cornilescu Oct 01 '19 at 02:52
1

Your design is good. In datastore (or other DB system) you have to index the next trigger date (I recommend in timestamp format)

When your scheduler runs your job (function, Cloud Run or other), it performs a query where Now()> trigger timestamp. For each document found, you have to simply publish it in PubSub (simply the ID, or the whole document, as you want)

  • Here PubSub helps to avoid hotspot and it's used as processing buffer

Set a push subscription to your PubSub topic for processing the message and effectively send the notification. After the sending, the Next trigger timestamp is updated according with your use case and specs

Here, the process which send notification can be massively parallel (use Cloud Function or Cloud Run -> I recommend this one because of concurrency processing on a same instance and thus lowest cost)

Be careful, your scheduler must not trigger too often your process. (Avoid every minute, prefer every 10 minutes or more for avoiding duplicate notification)

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • My hotspot concern is not at the time of doing the scheduling which I can easily control the rate. It is the creation of transactions that eventually need the reminders. Say I have a spike in the creation of transactions, then the timestamp index is going to cause problems. Else where it was mentioned that even this is not a problem as long as it's not more than 500/sec. I am trying to put a design in place where I don't have to worry about these limitations. – Siva Sep 28 '19 at 19:27
  • Understood. By the way, as I already say, your design is good. In case of spike, you have to partition your timestamp index to avoid hotspotting. However, when you query all your deadline, you have to parse all partition. Here again, the partition scan can be done in parallel thanks to PubSub, push subscription and functions/CloudRun. Keep in a side collection the list of partition that you have. Sadly, there is no simple solution in case of very high request rate. – guillaume blaquiere Sep 28 '19 at 20:46