3

I have growing data in GCS and will have a batch job that runs lets say every day to process 1 million of articles increment. I need to get additional information for the keys from BigTable (containing billions of records). Is it feasible to do just a lookup with every item in map operation? Does it make sense to batch those lookups and perform something like bulk read? Or what is the best way for this use case using scio/beam?

I found in Pattern: Streaming mode large lookup tables that performing lookup on every request is recommended approach for streaming, however I'm not sure if I wouldn't overload BigTable by the batch job.

Do you guys have any overall or concrete recommendation how to handle this use case?

Adam Horky
  • 113
  • 6

2 Answers2

3

I've helped others do this before, but in base Database / Beam. You'll need to aggregate the keys in batches fo optimal performance. Somewhere between 25 - 100 keys per batch would make sense. If you could pre-sort the list so that your lookups are more likely to hit fewer Cloud Bigtable nodes per request.

You can use the Cloud Bigtable client directly, just make sure to use the "use bulk" setting, or have a singleton to cache the client.

This will definitely have an impact on your Cloud Bigtable cluster, but I couldn't tell you how much. You may need to increase the size of your cluster so that other uses of Cloud Bigtable don't suffer.

Solomon Duskis
  • 2,691
  • 16
  • 12
  • Thanks a lot @SolomonDuskis . I suppose that also bloom filters might help to reduce certain lookups but I cannot find how to add it to BigTable (as oppose to HBase). Is it possible or it just somehow works OOTB already? – Adam Horky Mar 27 '19 at 08:36
  • There currently isn't a way to configure bloom filters. It might be worth asking about in in the "google-cloud-bigtable-discuss" group that you can find here: https://cloud.google.com/bigtable/docs/support/getting-support – Solomon Duskis Mar 27 '19 at 15:28
  • Yep, I already found similar question there https://groups.google.com/forum/#!activity/google-cloud-bigtable-discuss/1_3VX374DgAJ/google-cloud-bigtable-discuss/fPW4678Yq9c/1_3VX374DgAJ – Adam Horky Mar 28 '19 at 08:31
2

We have a helper BigtableDoFn in Scio. It doesn't batch but at least abstract away the async request handling in a DoFn so processElement/map function is not blocked by the network roundtrip.

Neville Li
  • 420
  • 3
  • 10