0

I have a Cloud Function in Python 3.7 to write/update small documents to Firestore. Each document has an user_id as Document_id, and two fields: a timestamp and a map (a dictionary) with three key-value objects, all of them are very small.

This is the code I'm using to write/update Firestore:

    doc_ref = db.collection(u'my_collection').document(user['user_id'])
    date_last_seen=datetime.combine(date_last_seen, datetime.min.time())
    doc_ref.set({u'map_field': map_value, u'date_last_seen': date_last_seen})

My goal is to call this function one time every day, and write/update ~500K documents. I have tried the following tests, for each one I include the execution time:

Test A: Process the output to 1000 documents. Don't write/update Firestore -> ~ 2 seconds

Test B: Process the output to 1000 documents. Write/update Firestore -> ~ 1 min 3 seconds

Test C: Process the output to 5000 documents. Don't write/update Firestore -> ~ 3 seconds

Test D: Process the output to 5000 documents. Write/update Firestore -> ~ 3 min 12 seconds

My conclusion here: writing/updating Firestore is consuming more than 99% of my compute time.

Question: How to write/update ~500 K documents every day efficiently?

alek6dj
  • 334
  • 3
  • 17

2 Answers2

1

It's not possible to prescribe a single course of action without knowing details about the data you're actually trying to write. I strongly suggest you read the documentation about best practices for Firestore. It will give you a sense of what things you can do to avoid problems with heavy write loads.

Basically, you will want to avoid these situations, as described in that doc:

High read, write, and delete rates to a narrow document range

Avoid high read or write rates to lexicographically close documents, or your application will experience contention errors. This issue is known as hotspotting, and your application can experience hotspotting if it does any of the following:

  • Creates new documents at a very high rate and allocates its own monotonically increasing IDs.

  • Cloud Firestore allocates document IDs using a scatter algorithm. You should not encounter hotspotting on writes if you create new documents using automatic document IDs.

  • Creates new documents at a high rate in a collection with few documents.

  • Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.

  • Deletes documents in a collection at a high rate.

  • Writes to the database at a very high rate without gradually increasing traffic.

I won't repeat all the advice in that doc. What you do need to know is this: because of the way that Firestore is built to scale massively, limits are placed on how quickly you can write data into it. The fact that you have to scale up gradually is probably going to be your main problem that can't be solved.

Doug Stevenson
  • 297,357
  • 32
  • 422
  • 441
  • Suppose i follow the "ramping up" recommendation, scaling the number of writes every 5 minutes, using the formula previous_writes+previous_writes*0.5 each time. Supose after 2 hours, Firebase is ready to scale to 500K writes. What is spected to happend tomorrow when my function will try to write/update again 500K documents? Firestore will be fast? – alek6dj Apr 21 '20 at 07:56
  • Only if you sustain that write rate. – Doug Stevenson Apr 21 '20 at 16:20
1

I achieved my needs with batched queries. But according to Firestore documentation there is another faster way:

Note: For bulk data entry, use a server client library with parallelized individual writes. Batched writes perform better than serialized writes but not better than parallel writes. You should use a server client library for bulk data operations and not a mobile/web SDK.

I also recommend to take a look to this post in stackoverflow with examples in Node.js

alek6dj
  • 334
  • 3
  • 17