How can we implement an efficient queue using Azure serverless technologies (e.g. Azure Servicebus) to call Azure OpenAI service concurrently but guarantee earlier messages are processed first?
The complexity is that the rate limit is not based on X requests per minute based on a 'rolling window'. But instead it is about tokens per minute and Azure implements a 1 minute timer (which we don't know when it resets). Here is an explanation of the rate limit policy: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/quota#understanding-rate-limits
Assuming the following "queue" and a rate limit of 10.000 TPM:
- Request 1) 2000 expected tokens
- Request 2) 5000 expected tokens
- Request 3) 5000 expected tokens
- Request 4) 2000 expected tokens
- Request 5) 7000 expected tokens
We would like the 'queue' to concurrently process request 1 and 2. 'Realize' that request 3 will overshoot the token limit and 'schedule' one minute of waiting, then take on request 3 & 4 concurrently, schedule one minute of waiting and process request 5.
In theory we don't need to 'schedule' and can just hit the rate limit with a retry policy (maybe better than scheduling since we don't know the moment the timer resets and the exact tokens that Azure estimates the request will cost). But in this case how do we make sure we don't end up with a race condition where request 3,4,5 all fail and retry and 5 gets through before 3?
In theory an even more intelligent solution would process 1,2,4 in parallel. Wait a minute and then process 3, wait a minute and then process 5. Where 4 is allowed to go before 3 only because it fits within the minute's limit which would otherwise be 'wasted'.