Accessing parent data for each DocumentReference in collection group query

Question

I have a collection of companies and inside of each company document, I have a collection of appointments. I want to loop through all appointments of all companies in a cloud function, so I am using the following collection group query:

db.collectionGroup('appointments')
    .get()
    .then((querySnapshot: any) => {
        querySnapshot.forEach((appointmentDoc: any) => {
            const appointment: Appointment = appointmentDoc.data();
            appointmentDoc.ref.parent.parent.get().then((companyDoc: any) => {
                const company: Company = companyDoc.data();
                ...
            });
        });
     });

As you can see, in each iteration, I am also getting data for the company that the appointment came from. This works, but I'm concerned about performance. If I have 500 appointments, then isn't this method basically making 501 calls to the database (1 for the appointments and then getting the company data for all 500 appointments)? Is there a better way I can access that parent data so I'm not making all those extra calls? Would be great if I can do this in a way that scales.

can't you get all the companies and `.then` all appointments and then reduce the datasets by some foreign key between appointment and companies? — argentum47, Mar 24 '20 at 03:57
I did initially make a call for companies first and looped through all appointments of each company, but was trying a different approach here with the collection group query as I thought I might be able to do things more efficiently. It's looking like that probably isn't the case. — Bryan, Mar 24 '20 at 04:36

score 1 · Answer 1 · answered Mar 24 '20 at 03:55

1

There is no way to get the parent documents at the same time as the documents from the appointments collection.

The only thing you can do is gather the document IDs into batches of 10 and then doing an IN query with them. But I doubt it's worth the effort, because the wire traffic is likely pretty much the same.

Note that performance does not usually correlate linearly with the number of calls though, so test before trying to optimize it. Also see Google Firestore - how to get document by multiple ids in one round trip?.

Also: do consider why you need 500 documents at once. You'll typically want to load a screenful of data, and this seems a lot more. For general hints about data modeling in Firestore, I recommend the first bunch of episodes of Getting to know Cloud Firestore.

answered Mar 24 '20 at 03:55

Frank van Puffelen

565,676
79
828
807

In my actual code I have a filter so that it only shows appointments that are within 1 week of the given day to keep the length of the array down some. I'm getting so many at once because I'm running a cron job every minute to check if there are any appointment reminders that need to be sent. I'm guessing there is a more efficient way to do it and that's ultimately what I'm trying to figure out. – Bryan Mar 24 '20 at 04:06
So hypothetically, looping over 500 companies, `.then` looping over 100 appointments for each company wouldn't necessarily be a crazy slow operation? I was picturing that as being equivalent to 50,000 API calls, which would be insane. – Bryan Mar 24 '20 at 04:41
If you need to have a list of reminders, I'd store a list of reminders in the database. So store the (additional) data in a format that makes the query possible in a not-insane way. – Frank van Puffelen Mar 24 '20 at 13:44

score 1 · Answer 2 · answered Mar 24 '20 at 04:06

Firestore doesn't actually bill you based on number of queries. It's based on number of document reads. So, if you have 500 appointments, your code is going to read 1000 documents, since it's reading a company document once for each appointment document.

What you can do instead is only read each company document just once total, not once for each appointment for that company. You can maintain a cache in memory for that, using something like this:

// cache of companies identified by their document ID
const companies: { [key: string]: Company } = {}

db.collectionGroup('appointments')
    .get()
    .then((querySnapshot: any) => {
        querySnapshot.forEach((appointmentDoc: any) => {
            const appointment: Appointment = appointmentDoc.data();
            const parentRef = appointmentDoc.ref.parent.parent
            const companyId = parentRef.id
            let company: Company
            if (companies[companyId]) {
                company = companies[companyId]
                // work with cached company here
            }
            else {
                parentRef.get().then((companyDoc: any) => {
                    company: Company = companyDoc.data();
                    companies[companyId] = company
                    // work with queried company here
                });
            }
        });
     });

Although this is incomplete, because the inner query is still asynchronous and will continue to query companies as fast as the appointment iterator can run. You will have to serialize the inner query somehow, or group the appointments by company ID and iterate the groups so that you don't fetch a company document more than once.

But I hope you get the idea here that using a memory cache can save you document reads.

Sounds like it would be better for me to not do a collection group query, and instead begin by making a call for all companies, then loop over each one and get that company's appointments to loop through? I started using the collection group query because I thought I'd be able to access the parent (company) data more easily. — Bryan, Mar 24 '20 at 04:32
Whatever you think is easiest. Just avoid doing unnecessary reads. — Doug Stevenson, Mar 24 '20 at 04:34

LeadDreamer · Answer 3 · 2020-03-27T17:51:28.237

1

I use a very hierarchical structure, which would look like it would have similar problems, BUT...

...with a NoSQL database like Firestore, you have to DROP the SQL mantra of DRY. If the data is static (for example, whatever "company" data you actually need for an appointment), you absolutely can and should COPY THAT DATA.

For example, you could quite trivially add to the appointment document the structure:

appointmentSchema = {
  ....
  ....
  company: {
    id: {string},
    name: {string},
    location: {string}
  }
}

Yes, this uses storage. So? Firestore mostly doesn't charge for this small amount of extra storage, and it does charge to fetch a new copy. Since this data isn't dynamically changing, it's much more efficient to add it to the appointment document when it is created.

document fetch should be reserved for dynamic data.

edited Mar 27 '20 at 17:51

answered Mar 24 '20 at 17:41

LeadDreamer

3,303
2
15
18

Company data can change. Would it be a good idea to use this approach anyway, and if company data happens to change, update all appointments with the new data? Seems like any approach I think of is overkill. – Bryan Mar 26 '20 at 00:38
Company data can change - but will the part an *appointment* needs to display change? For example, all you *might* need is company name, contact phone number, and ID (in case you *do* need more data, or to find data to update). And yes, to some degree NoSQL requirements *can* feel like overkill - until you realize the efficiency savings. – LeadDreamer Mar 27 '20 at 01:09

score 0 · Answer 4 · answered Mar 24 '20 at 17:44

Another point: the refPath of a document is a string representing the fully-qualified '/' separated path to the document:

root/topcollection/topdocumentId/nextcollection/nextdocumentId/bottomcollection/bottomdocumentId

...and you can directly parse this string to find collection names and documentId's anywhere up the path to the document. I use this quite a bit as well.

Accessing parent data for each DocumentReference in collection group query

4 Answers4