1

I have requirement to load about 10 million data from BigQuery to firestore every day. Which is the fastest way to do this ?

Cloud function with parallel individual writes is an option (according to the below link), but in this case, parallelizing the bigquery table would be challenge.

What is the fastest way to write a lot of documents to Firestore?

Does Dataflow works in this scenario, read and write data through Dataflow ?

Akhil
  • 498
  • 2
  • 6
  • 22

1 Answers1

1

Dataflow works in this case. It lets you parallelize how you read data from BigQuery and write it into Firestore.

There is a work-in-progress to add a Firestore sink to Beam. It should be available for the Java SDK in Beam 2.31.0: See https://github.com/apache/beam/pull/14261

In the meantime, you may be able to roll your own: In Python it would be like so:

(p 
 | ReadFromBigQuery(...)
 | GroupIntoBatches(50)  # Batches of 50-500 elements will help with throughput
 | ParDo(WriteToFirestoreDoFn())

Where you write your own WriteToFirestoreDoFn that does something like this:

class WriteToFirestoreDoFn(DoFn):
  def __init__(self, firestore_info):
    self.client = None
    self.firestore_info = firestore_info
  
  def process(self, batch):
    if not self.client:
      self.client = firestore.Client(self.firestore_info)
    self.client.write_data(batch)

This is a little pseudocody, but it should help you get started with what you want.

Pablo
  • 10,425
  • 1
  • 44
  • 67
  • Great, any performance numbers to share now ? And what could be best batch size ? According to firestore doc, max 10k writes can be done against the database per second – Akhil Apr 10 '21 at 04:43
  • I am not sure, as I don't know firestore well. I have used 500 record inserts before, but it may vary. I recommend you test it out. I suppose that 150 records may be reasonable, and allow you to reach high parallelism for your transform – Pablo Apr 12 '21 at 22:35