0

Assuming a collection size of +100k docs, what is the most effective way to update all documents?

We are talking about a database-wide update process, where I need to read each doc, add/remove some fields depending on specific rules, then write the updated fields in the doc back to the collection. And repeat this process for all docs.

And ideally I would like to be able to do it in the cloud. Using my local computer is already proving to be problematic in terms of how much time it takes and the increasing memory load.

Cloud functions don't seem to be a great way either, due to both memory constraints (2gb) and the timeout limitation of 9 minutes, even when using batched writing...

aragalie
  • 175
  • 2
  • 9
  • Have you taken a look at [this answer](https://stackoverflow.com/a/56145214/5387454) it addresses a similiar issue and complements with what Edmund commented – Louis C Jun 18 '20 at 23:39
  • @Edgardo yes I have, but the issue is not similar: my issue is not related to the batch writing aspect (as is in that answer), but to the way in which this operation can be managed in a cloud-based manner. – aragalie Jun 19 '20 at 06:07

1 Answers1

2

This can definitely be done with cloud functions.

I don't think you should be hitting memory constraints unless you're trying to query the entire collection at once. Use paginated querying to cap the number of documents read at once and then loop through the pages instead.

Option 1

One quick way to get around the timeout limitation is to use a pub/sub trigger cloud function. When your function is about to timeout, just have it publish to it's own pub/sub topic to trigger itself to run again. But make sure your function stops publishing once there are no longer any more documents to update, otherwise it'll get stuck in an infinite loop.

Option 2

If you need this update task to be preformed extremely quickly, you can use a divide and conquer strategy that looks like this.

[Fn A] =publish=to=> [Pub/Sub] =trigger=> [Fn B], [Fn B], [Fn B], [Fn B] . . .

Cloud function A queries the collection using paginated querying with pages of size N. Publish the uid of the first document on each page and the value of N to a pub/sub topic.

Write a cloud function B that is triggered by that pub/sub topic. It'll read the document uid and number N from pub/sub. It'll use that document uid as a starting point and then update the next N documents. This function will be triggered many times in parallel. Once for each pub/sub publication from Fn A. Your choice of N will influence the number of instances of function B that will spawn.

  • Appreciate the answer @Edmund. For Option 1, how would you propose to let the new instance of the function know where the old instance left off, assuming i'm using paginated querying? And what would happen for the cases where it's in the middle of a batch? – aragalie Jun 18 '20 at 18:30
  • I would write the query such that it only returns documents that haven't yet been updated. That way your function can always just start at the beginning of the list of docs returned by the query. This can be done by adding the [where()](https://firebase.google.com/docs/firestore/query-data/queries) function to your query. For example: ``` lang-js query = db.collection("cities") .where("population", ">", 100000) .orderBy("population") .limit(5000); ``` – Edmund Rosewright Jun 18 '20 at 23:20
  • Thanks for the clarification. This will not work in my case though, as I would only know if a doc needs to be updated once I retrieve and perform the calculations on it; in this case `where()` doesn't help. So I believe only Option 2 will be feasible, and I'll give that a try. – aragalie Jun 19 '20 at 06:09
  • It's ok if you can't use `where()` in Option 1. In that case you can have your cloud function publish the uid of the last document it updated to pub/sub. So then the next invocation of the cloud function will know where the previous invocation left off. – Edmund Rosewright Jun 19 '20 at 07:23