0

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?

For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.

My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.

An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.

Is there any more efficient ways to do that? Maybe a better design? Thanks!

[1] https://cloud.google.com/datastore/docs/concepts/limits

greeness
  • 15,956
  • 5
  • 50
  • 80
  • I think you're already there with the batches. For the middle layer - where/how would you store the data to use with it? Just curious (can't think of the use case) - how do you obtain that huge list of keys to lookup? – Dan Cornilescu Apr 16 '18 at 17:40
  • Thanks Dan for the comment ! The plan could be like this (assume we never delete keys). Every time we insert a key, we add it through the middle layer too, so the bloom-filter in the middle layer has a knowledge of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. Yes I know this is adding many complications too. – greeness Apr 16 '18 at 17:50
  • I mean to make these requests asking for key existence - how would the key be obtained? Built following some algorithm? And once the key is determined to exist - would the respective entities would be retrieved as well? I goess I'm still trying to wrap my mind around the use case :) – Dan Cornilescu Apr 16 '18 at 18:18
  • I see. I add the context to the question. :D – greeness Apr 16 '18 at 18:28

1 Answers1

3

I'd suggest breaking down the problem in a more scalable (and less costly) approach.

In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore. The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.

The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.

I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.

When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:

  • if it does just create the like mapping entity
  • if it doesn't and you used the above-mentioned unique key identifiers:
    • create the document entity and launch its crawling job
    • create the like mapping entity
  • otherwise:
    • launch the crawling job which creates the document entity taking care of deduplication
    • launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.

Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).

With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
  • Thanks for the answer, Dan. I understand your points. But I have one comment. My use case does need to process all the user likes in a batch (at least the first time). Think about a social network (facebook) app example. When a new user registers for an app, though he is a new user for the app, he probably already has thousands of likes in his profile. We do want to build a profile for him as fast as possible and present to the new user to impress him with something customized for him promptly. That's why It is important to do this batch lookup. – greeness Apr 17 '18 at 19:49
  • OK, I think I get the idea. Well, you could have the pages already crawled in advance based on the outer profile info, so at app signup you'd know that most (if not all) of them are already there and you'd just have to present them. – Dan Cornilescu Apr 18 '18 at 14:48
  • Or you could quickly prepare just a few of them to present immediately and, while the user checks those out, continue adding more in the background. – Dan Cornilescu Apr 18 '18 at 14:52
  • I will go ahead with my batch query approach to solve the warm-start problem. Thanks for the links in your anwer, which are all relevant and helpful! – greeness Apr 20 '18 at 00:05