3

This question is in regards to concurrent access to saga data, when saga data is persisted in Azure Table Storage. It is also references information found in Particular's documentation: http://docs.particular.net/nservicebus/nservicebus-sagas-and-concurrency

We've noticed that, within a single saga executing handlers concurrently, modifications to saga data appear to be operating in a "last one to post changes to azure table storage wins" scenario. Is this intended behavior when using NSB in conjunction with Azure Table Storage as the Saga data persistence layer?

Example:

  1. Integer property in Saga Data, assume it currently = 5
  2. 5 commands are handled by 5 instances of the same handler in this saga
  3. Each command handler decrements the integer property in saga data
  4. Final value of the integer property in saga data could actually be 4 after handling these 5 messages - if each message was handled by a new instance of the saga, potentially on different servers, each having a copy of saga data indicating the integer property is 5, decrement it to 4, and post back up. What I just described was the extremely concurrent example, however it is likely the integer will be greater than 0 if any of the 5 messages were handled concurrently, the only time the saga data integer property reaches 0 is when the 5 commands happen to have executing serially.

Also, as Azure Table Storage supports Optimistic Concurrency, is it possible to enable the use of of this feature for Table Storage just as it is enabled for RavenDB when Raven is used as the persistence tech?

If this is not possible, what is the recommended approach for handling this? Currently we are subscribing to the paradigm that any handler in a saga that could ever potentially be handling multiple messages concurrently is not allowed to modify saga data, meaning our coordination of saga message is being accomplished via means external to the saga rather than using Saga Data as we'd initially intended.

MSC
  • 2,011
  • 1
  • 16
  • 22

2 Answers2

1

After working with Particular support - the symptoms described above ended up being an defect in NServiceBus.Azure. This issue has been patched by Particular in NServiceBus.Azure 5.3.11 and 6.2+. I can personally confirm that updating to 5.3.11 resolved our issues.

For reference, a tell-tale sign of this issue manifesting itself is the following exception getting thrown and not getting handled.

Failed to process message Microsoft.WindowsAzure.Storage.StorageException: Unexpected response code for operation : 0

The details of the exception will indicate that "UpdateConditionNotSatisfied" - referring to the optimistic concurrency check.

Thanks to Yves Goeleven and Sean Feldman from Particular for diagnosing and resolving this issue.

MSC
  • 2,011
  • 1
  • 16
  • 22
  • We are on 5.3.11 and after about a week running on Azure storage queues we got 6K error messages. Continuous failures with Microsoft.WindowsAzure.Storage.StorageException: Unexpected response code for operation : 0 – Alexey Zimarev May 06 '15 at 13:03
0

The azure saga storage persister uses optimistic concurency, if multiple messages arrive at the same time, the last one to update should throw an exception, retry and make the data correct again.

So this sounds like a bug, can you share which version you're on?

PS: last year we have resolved an issue that sounds very similar to this https://github.com/Particular/NServiceBus.Azure/issues/124 it has been resolved in NServiceBus.Azure 5.2 and upwards

Yves Goeleven
  • 2,185
  • 15
  • 13
  • We've run into this issue in NServiceBus version 4.6.4. Issue 124 could definitely be the root cause of the symptoms I described in the original question. Would you recommend we upgrade our solution to depend on NSB 5.2 as a result? – MSC Feb 13 '15 at 14:00
  • NServiceBus Azure 5.2.0 should work together with NServiceBus.4.6.4 so only upgrading the azure packages should already do – Yves Goeleven Feb 13 '15 at 14:11
  • Wanted to provide additional version information: We are also using NServiceBus.Azure version 6.2. This leads me to believe we are already using a version that contains the fix for issue 124 - especially given the issue is over a year old. – MSC Feb 13 '15 at 15:03
  • Yes that fix should be in 6.2 as well, strange... I'll have to look into it. Just for completeness sake, which azure storage SDK version are you using? – Yves Goeleven Feb 14 '15 at 19:07
  • I have create an issue for this and will try to figure out what is going on https://github.com/Particular/NServiceBus.Azure/issues/248, but according to azure storage documentation and sources https://github.com/Azure/azure-storage-net/blob/master/Lib/ClassLibraryCommon/Table/TableOperation.cs#L100, the replace operation that we leverage internally should perform optimistic concurrency on our behalf... – Yves Goeleven Feb 16 '15 at 10:19
  • We are using the Azure 2.3 SDK - and the project in question uses storage library Microsoft.WindowsAzure.Storage v4.1. We noticed in testing that the current implementation appears to attempt to support optimistic concurrency, as we logged a couple of the command responses getting handled multiple times (and the only operation the handler was doing was decrementing a counter in saga data). This led us to assume the current implementation supports optimistic concurrency, but that there may be a defect. I will PM you a screenshot of our logs depicting this. – MSC Feb 16 '15 at 13:33
  • I have just finished testing this using Azure SDK 4.3, and it works flawlessly, concurrency is checked, an exception is thrown and the operation is retried. Can you share me your reproduction? – Yves Goeleven Feb 16 '15 at 13:50
  • I have built a simple test project, containing all the same dependencies as our actual solution, which exhibits the symptoms described in this post. How can I provide this solution to you? Can I open up a support ticket with Particular? I could also zip up the solution (minus packages), host them for you to download, you’d just have to use Nuget to restore the packages. – MSC Feb 16 '15 at 21:09
  • either is fine, or just mail me at yves ad goeleven dot com – Yves Goeleven Feb 16 '15 at 21:32
  • I reviewed the test project above, the concurrency control is working as expected, the test is just not waiting long enough for the retries to be finished (a retry will only occur after 30 seconds when visibility timeout of the message expires) – Yves Goeleven Feb 17 '15 at 08:16
  • I have sent some additional information via a follow-up email. Unfortunately, I was unable to attribute the issue described above to Azure Service Bus lock timeouts. – MSC Feb 17 '15 at 16:30