For the past week or so, we've been experiencing 504, Gateway Timeout
errors while making fetching email messages from the MS Graph API. Prior to that for over a month of running, the same application did not experience that error, at least not in any significant frequency.
We are using V1.0 of the MS Graph API
Our query is fairly simple:
$top=100&$orderBy=lastModifiedDateTime desc&$filter=lastModifiedDateTime lt 2019-09-09T19:27:55Z and parentFolderId ne 'JunkEmail'
We get the timeout for users who have large volumes of data (> 100K email messages), but occasionally do get it for users with lesser (around 18K email messages) volume. Volume has not changed much from the time where the system was working, to now when we see many timeouts.
We've tried simplifying the query, reducing the number of messages we request etc., but that seems to have only limited and intermittent impact.
My question - What can we do to eliminate/significantly reduce the possibility of getting the 504, Gateway Timeout error from the MS Graph API?
I suspect that since we are asking for messages without a folder filter, it may be possible that we are stressing out the query engine. Just a hunch, and if any one has real insight into MS Graph API, i'd love to know if that may be possible. Also, any information that helps us better understand what is going on under the hood would be much appreciated.
Update 1 (2019-09-13 15:44:00 EST) - Here is a visualization of a set of fetch requests made by the app over a 12 hour period (approximately). The pink bars are the number of successful fetches, and the light blue ones are the failed requests (all having 504, Gateway Timeout as the failure code). As you can see, when the app starts it has a number of failures, which eventually reduce and go away. Then from around 4:30AM to 9:30AM, there are a number of failures, which eventually subside. Almost all failures happen while fetching messages for one user, who has a very large mailbox (> 220K messages). I realize this is a small data set, and am happy to generate one that runs for a longer period of time if that helps. Also, the app in question is running on our Azure tenant, as a part of a Azure Function app, in the "East US" location.
Update 2, (16th Sept 2019, 09:32:00 EST) - We ran the system for the last 3 days and here is a visualization of the fetch requests made by the app during that time. The blue bars are successful fetches, and the pink bars are failed fetched (all having 504, Gateway Timeout as the failure code). The summary is that except for a small window 11PM - 2AM on the first night, no request succeeded for this one particular user with a large mailbox. In effect, that means that inspite of retry logic etc., we are unable to process that user's data.