Managing Denormalized/Duplicated Data in Cloud Firestore

Question

If you have decided to denormalize/duplicate your data in Firestore to optimize for reads, what patterns (if any) are generally used to keep track of the duplicated data so that they can be updated correctly to avoid inconsistent data?

As an example, if I have a feature like a Pinterest Board where any user on the platform can pin my post to their own board, how would you go about keeping track of the duplicated data in many locations?

What about creating a relational-like table for each unique location that the data can exist that is used to reconstruct the paths that require updating.

For example, creating a users_posts_boards collection that is firstly a collection of userIDs with a sub-collection of postIDs that finally has another sub-collection of boardIDs with a boardOwnerID. Then you use those to reconstruct the paths of the duplicated data for a post (eg. /users/[boardOwnerID]/boards/[boardID]/posts/[postID])?

Also if posts can additionally be shared to groups and lists would you continue to make users_posts_groups and users_posts_lists collections and sub-collections to track duplicated data in the same way?

Alternatively, would you instead have a posts_denormalization_tracker that is just a collection of unique postIDs that includes a sub-collection of locations that the post has been duplicated to?

{
  postID: 'someID',
  locations: ( <---- collection
    "path/to/post/location1",
    "path/to/post/location2",
    ...
  )
}

This would mean that you would basically need to have all writes to Firestore done through Cloud Functions that can keep a track of this data for security reasons....unless Firestore security rules are sufficiently powerful to allow add operations to the /posts_denormalization_tracker/[postID]/locations sub-collection without allowing reads or updates to the sub-collection or the parent postIDs collection.

I'm basically looking for a sane way to track heavily denormalized data.

Edit: oh yeah, another great example would be the post author's profile information being embedded in every post. Imagine the hellscape trying to keep all that up-to-date as it is shared across a platform and then a user updates their profile.

I think the Functions are essential to have consistency, bringing in Listeners on the Delete or Modifiying Actions of a Documents root will make it changed everywhere, I wouldnt use a Client Side solution it would get messy realyl really fast. — niclas_4, Jan 18 '19 at 13:15
@Badgy Yeah, I wouldn't even be possible to do client-side as various security rules (think: private boards) would prevent the client from performing an update. Also, you wouldn't want to use Functions to basically do massive searches for possible locations of duplicated data as that would be prohibitively expensive (from both the perspective of a long-running Function and reads to Cloud Firestore). — Socceroos, Jan 18 '19 at 13:16

score 4 · Answer 1 · answered Jan 18 '19 at 14:22

4

I'm aswering this question because of your request from here.

When you are duplicating data, there is one thing that need to keep in mind. In the same way you are adding data, you need to maintain it. With other words, if you want to update/detele an object, you need to do it in every place that it exists.

What patterns (if any) are generally used to keep track of the duplicated data so that they can be updated correctly to avoid inconsistent data?

To keep track of all operations that we need to do in order to have consistent data, we add all operations to a batch. You can add one or more update operations on different references, as well as delete or add operations. For that please see:

How to do a bulk update in Firestore

What about creating a relational-like table for each unique location that the data can exist that is used to reconstruct the paths that require updating.

In my opinion there is no need to add an extra "relational-like table" but if you feel confortable with it, go ahead and use it.

Then you use those to reconstruct the paths of the duplicated data for a post (eg. /users/[boardOwnerID]/boards/[boardID]/posts/[postID])?

Yes, you need to pass to each document() method, the corresponding document id in order to make the update operation work. Unfortunately, there are no wildcards in Cloud Firestore paths to documents. You have to identify the documents by their ids.

Alternatively, would you instead have a posts_denormalization_tracker that is just a collection of unique postIDs that includes a sub-collection of locations that the post has been duplicated to?

I consider that isn't also necessary since it require extra read operations. Since everything in Firestore is about the number of read and writes, I think you should think again about this approach. Please see Firestore usage and limits.

unless Firestore security rules are sufficiently powerful to allow add operations to the /posts_denormalization_tracker/[postID]/locations sub-collection without allowing reads or updates to the sub-collection or the parent postIDs collection.

Firestore security rules are so powerful to do that. You can also allow to read or write or even apply security rules regarding each CRUD operation you need.

I'm basically looking for a sane way to track heavily denormalized data.

The simplest way I can think of, is to add the operation in a datastructure of type key and value. Let's assume we have a map that looks like this:

Map<Object, DocumentRefence> map = new HashMap<>();
map.put(customObject1, reference1);
map.put(customObject2, reference2);
map.put(customObject3, reference3);
//And so on

Iterate throught the map, and add all those keys and values to batch, commit the batch and that's it.

answered Jan 18 '19 at 14:22

Alex Mamo

130,605
17
163
193

2

Yes, so I am aware that you can perform batch updates for denormalized data. But how do you keep a track of all the references to where that data exists? As an example, if 100 users have `pinned` a `post` to their own boards (both private and public), then before I can update that `post` I would first need to search *every* user's boards to see if they've got the `post` in their board. How do people get around that? – Socceroos Jan 20 '19 at 22:24
1

I'm usually using an array where I store references to objects that I need to update in order to have consistent data. Assuming we have a property named `photoUrl` under each user object that holds a reference to a image, when this image is changed, all the corresponding images that exist in other objects should be changed. The operation is simple, I read the user document, I read the `photoUrl` and the array property and then just add the new imageUrl along with the references and make the update. That's it. You can achieve the same thing also with a map. Is it ok now? – Alex Mamo Jan 21 '19 at 09:11
Where do you store the hashmap? on the client? That just won't work for references to a private location that the client can't write to. Who updates the hashmap with new locations of an object's data? The user who own's the data won't know when some other user pins their post to a private board. How do you track that? I'm not sure you're answering my question. Perhaps I've framed my question badly. – Socceroos Jan 23 '19 at 02:23
I think you misunderstood my answer. That HashMap that I was talking about is only for the references of a particular user. If you need to update references of multiple users then you should store that information in the database. – Alex Mamo Jan 23 '19 at 08:28
@Socceroos How you kept track of all references to where data exists, I have tasks (collection) that has experts field that contain duplicated user data (object of objects) and I need to get reference to all updated expert data of each (task) collection to add new changes ... how to do so with out many reads and writes? – Ahmed Saeed Jul 03 '20 at 03:23
if I make a reference collectionRef = db.collection('tasks').where(`experts/${uid}/id`, '==', `uid`) how to use this reference to update duplicated data that collectionRef refere to ... documentation isn't clear! – Ahmed Saeed Jul 03 '20 at 05:27
@AhmedSaeed That reference is actually a Query. If you want to update all elements that are apart of the result set, check **[this](https://stackoverflow.com/questions/52480575/cloud-firestore-update/52481163)** out. – Alex Mamo Jul 03 '20 at 08:57
1

@AlexMamo Do you imagine, How it would be if I have 1000 duplicated document, your solution would cost me 1000 reads then 1000 writes so how duplicated data benefits me if it would cost me that much? – Ahmed Saeed Jul 03 '20 at 13:11
@AhmedSaeed That's the price when you are using denormalization. – Alex Mamo Jul 03 '20 at 13:18
@Ahmed Saeed Hi guys, I’m in the same situation right now. I found this article on « multi paths update » maybe it can help https://medium.com/@danbroadbent/firebase-multi-path-updates-updating-denormalized-data-in-multiple-locations-b433565fd8a5 – Steffi Jan 18 '22 at 14:29
Update : The link is related to Firebase Realtime Database, not Firestore – Steffi Jul 25 '22 at 14:39

Managing Denormalized/Duplicated Data in Cloud Firestore

1 Answers1

Linked