5

For the past week or so, we've been experiencing 504, Gateway Timeout errors while making fetching email messages from the MS Graph API. Prior to that for over a month of running, the same application did not experience that error, at least not in any significant frequency.

  • We are using V1.0 of the MS Graph API

  • Our query is fairly simple:

$top=100&$orderBy=lastModifiedDateTime desc&$filter=lastModifiedDateTime lt 2019-09-09T19:27:55Z and parentFolderId ne 'JunkEmail'

  • We get the timeout for users who have large volumes of data (> 100K email messages), but occasionally do get it for users with lesser (around 18K email messages) volume. Volume has not changed much from the time where the system was working, to now when we see many timeouts.

  • We've tried simplifying the query, reducing the number of messages we request etc., but that seems to have only limited and intermittent impact.

My question - What can we do to eliminate/significantly reduce the possibility of getting the 504, Gateway Timeout error from the MS Graph API?

I suspect that since we are asking for messages without a folder filter, it may be possible that we are stressing out the query engine. Just a hunch, and if any one has real insight into MS Graph API, i'd love to know if that may be possible. Also, any information that helps us better understand what is going on under the hood would be much appreciated.

Update 1 (2019-09-13 15:44:00 EST) - Here is a visualization of a set of fetch requests made by the app over a 12 hour period (approximately). The pink bars are the number of successful fetches, and the light blue ones are the failed requests (all having 504, Gateway Timeout as the failure code). As you can see, when the app starts it has a number of failures, which eventually reduce and go away. Then from around 4:30AM to 9:30AM, there are a number of failures, which eventually subside. Almost all failures happen while fetching messages for one user, who has a very large mailbox (> 220K messages). I realize this is a small data set, and am happy to generate one that runs for a longer period of time if that helps. Also, the app in question is running on our Azure tenant, as a part of a Azure Function app, in the "East US" location.

Graph of fetch requests over a 12 hour period

Update 2, (16th Sept 2019, 09:32:00 EST) - We ran the system for the last 3 days and here is a visualization of the fetch requests made by the app during that time. The blue bars are successful fetches, and the pink bars are failed fetched (all having 504, Gateway Timeout as the failure code). The summary is that except for a small window 11PM - 2AM on the first night, no request succeeded for this one particular user with a large mailbox. In effect, that means that inspite of retry logic etc., we are unable to process that user's data.

enter image description here

floatingfrisbee
  • 928
  • 1
  • 10
  • 28
  • One more noteworthy piece of information is that the timeouts happen at around 30 seconds sometimes and around 40 seconds the rest of the times. – floatingfrisbee Sep 11 '19 at 15:46
  • An update - we've been tracking the data to see when and how often we get 504, Gateway Timeouts, and a pattern is starting to emerge. Basically we have periods of time when we see a 50-100% failure rate for fetch requests for the user who has > 200K messages in their mailbox. Usually these periods are when we first start processing the mailbox, i.e. after a period of inactivity. Over time, the rate of errors starts to decrease, and eventually we get long periods (hours) with no 504 errors. This is based on running the system over the last couple of days, so not on a huge amount of data. – floatingfrisbee Sep 12 '19 at 20:04

3 Answers3

3

Microsoft Graph can be slow at times and will throttle occasionally.

I'd advise you let the Graph SDK do the hard work to save you from writing code to handle all this yourself.

Use the Microsoft Graph client library version 1.17.0+ as it introduced auto retry on 504 errors. It alsos handle throttling (code 429) when they occur.

The point I am trying to make is that you can retry when you get a 504 or 429 yourself or delegate such responsibilities to a SDK

Kalyan Krishna
  • 1,616
  • 15
  • 19
  • 2
    Good to know about the auto retry for 504 in version 1.17.0. That will certainly help, but to be clear, we are not talking about throttling here. From the throttling link - `When throttling occurs, Microsoft Graph returns HTTP status code 429 (Too many requests), and the requests fail.` Throttling does not cause 504 errors. – floatingfrisbee Sep 12 '19 at 03:18
  • Thanks Kalyan, and it is certainly a huge positive to not have to deal with throttling and timeouts. What concerns me is that there are repeated timeouts (though no throttling, since we hit the API very sparingly) for a few users, and that prevents progress on the processing of their data. Is there no solution to prevent or at least significantly reduce the occurrence of the timeouts? – floatingfrisbee Sep 12 '19 at 04:00
  • @floatingfrisbee There isn't any prescriptive guidance on how to reduce the timeouts. Microsoft Graph is a service proxy sitting on top of many other services (which in turn rely on other services). There are many sources for the timeouts. Our best bet is to handle them gracefully. Are you seeing patterns in the timeouts? – Michael Mainer Sep 12 '19 at 21:10
  • @MichaelMainer thanks for your comment. I haven't yet noticed a strong pattern, but have updated my original post with some info that may be relevant. Happy to get you more information if helpful. While I understand that the Graph service is distributed in nature, if we have a significant number of failures for a user, which, in effect, prevents processing of their data in a reasonable time period, we do need a strategy to handle that. – floatingfrisbee Sep 13 '19 at 15:44
  • continued... On the bright side, between retry logic, and some other tweaks we've made, we seem to be in a good place for now, but I am worried about when we hit a larger mailbox that tips the scale. – floatingfrisbee Sep 13 '19 at 15:45
3

Good to hear that the retry is helping. I've got a couple of options to try:

1) Change your query and move the ordering responsibilities to the client. $orderBy=lastModifiedDateTime desc and the filter require indices to be created and this increase the load on the mailbox. Doing client-side ordering may be better for these large mailboxes.

2) Use delta query (with your filter) to sync and get incremental changes. You will have to add a folder hierarchy sync. You may be able to make parallel calls. I suspect that this will give you much better performance after the initial sync.

Michael Mainer
  • 3,387
  • 1
  • 13
  • 32
  • I may have spoken too soon. We had an instance run for the last three days, and had some pretty disappointing results. If you can, please take a look at the original post, I have made another update to it with more info. – floatingfrisbee Sep 16 '19 at 13:43
  • Re: point 1, about moving ordering to the client. I am not clear it will have the same semantic. What we are interested in is to go backward in time one batch at a time. So if there are 100K messages with a last modified date older than, say "2019-09-15 15:30:00". I want to get the 100 most recent among them, process them, and then get the next 100 most recent and so on. How do I make this work with sorting on the client side? Or are you saying that we relax the requirement of processing them in recency order (which may be ok, if that really helps solve the issue). – floatingfrisbee Sep 16 '19 at 13:44
  • Re: point 2, about using delta tokens. We had originally started with that approach, but since it requires queries to be made per folder, we abandoned it. It increase the complexity of code, and we'll certainly go there is that is our only option. However, I did expect that the mail endpoint offered by the Graph API would work for the most part. – floatingfrisbee Sep 16 '19 at 13:45
2

I encountered the same issue. 504 error while trying to get all messages. After a thorough inspection I figured that in our case the problem was draft items. In some cases they were throwing errors. After adding filter "isDraft eq false" 504 stopped and we're getting all messages. Turns out that some drafts are broken. They won't show up in OWA or Outlook and in our case the one that was messing with the query was stored under parentFolderId that was non-existent, which is a huge problem in and of itself in my opinion.

Thorgal
  • 161
  • 1
  • 3