I'm working with the Microsoft Graph API to process events coming from an external client/system. So, changes happen in that system, events get sent to a queue, and the code that I'm working on fetches those events from the queue, batches them in batches of 20, executing downstream actions in the Graph ecosystem. These actions could be anything really, but the simplest scenario I'm working on would be updating some field value in a SharePoint list item.
So, again, change happens in external system, goes to a queue, and my code consumes events from that queue batching them in batches of 20 requests, each one being a POST to update a SharePoint list item's field.
Technically, you could say that everything is working just fine. I have taken into account the batching/throttling guidance (Microsoft Graph throttling guidance) and, therefore, the code handles 429 responses by waiting for MAX(RETRY_AFTER)
where RETRY_AFTER
is the value for that response header for each individual request in the batch. After waiting, the code retries processing just the failed request in that batch.
This is a very common scenario in many event driven systems. When events are generate at low rate everything works just fine. Even if a couple of thousand events get queued up, catching up doesn't take too long. However, when the producer rate spikes or, let's say, some tens of thousands of events get queued up, throughput starts to get significantly low, cause the number of throttled requests starts increase.
I've also takes the necessary steps to prevent the consumer side of the solution from scaling up in a way that the number of parallel Graph requests could overwhelm the Graph API. I'm working with a max of 5 parallel consumers/batches.
And I've also read about the request limits for Graph (Microsoft Graph service-specific throttling limits) and SharePoint (Avoid getting throttled or blocked in SharePoint Online), and the 5 year old post Microsoft Graph API - Throttling.
What I'm seeing in my tests is that the consumer rate is as low as 10 requests per second because of throttling.
My question is:
Is this anything else that could increase throughput when applying changes to Graph? Or should 10 req/sec or so be considered in the max range and that's it?
PS: I'm looking for something that would improve the solution to handle spikes. Ideally wouldn't be anything like doubling, tripling, or multiplying infrastructure or licenses by an X factor. Ideally it would be something like scaling a system up for a short period of time (minutes) and then scaling it back down.