5

I have an Azure function with an Event Hub trigger. This hub receives messages from devices and stores them in Blob. Recently, I noticed that duplicate messages were stored in blob. Files in blob store are order by last modified date and if you look at the screenshot, you can see that's not the case. Has anyone seen this issue before?

I also have an Azure function that is writing to cosmos DB and for the duplicate messages in blob there is not corresponding duplicate message in cosmos.

I have also hooked up time series insights which also doesn't have any duplicate messages.

I turned on the event hub capture and there are no duplicate messages there as well.

Here's the screenshot.

enter image description here

The first column is the unix timestamp of enqueued time at the event hub. If I didn't have the guid associated with the filename it would have thrown an exception. Here's a snippet that stores data in blob.

dynamic msg = JObject.Parse(myEventHubMessage);
string deviceId = msg.deviceId;
if (deviceId == "5Y.....")
{
           var filename = "_" + ((DateTimeOffset)enqueuedTimeUtc).ToUnixTimeSeconds() + "_" + Guid.NewGuid().ToString() + ".json";
        
           var containerName = "containerName/";
        
           var path = containerName + deviceId + "/" + filename;
        
           using (var writer = binder.Bind<TextWriter>(new BlobAttribute(path)))
           {
                writer.Write(myEventHubMessage);
           }
 }

The logic here is very simple. If an event comes to the event hub, the function is triggered and it stores data in Azure Blob.

MAK
  • 1,250
  • 21
  • 50

1 Answers1

6

An important call-out is that Event Hubs has an at-least-once delivery guarantee; it is highly recommended to ensure that your processing is resilient to event duplication in whatever way is appropriate for your application scenarios.

With respect to the duplication that you're seeing in this case, the binding for Azure Functions makes use of the EventProcessorHost in order to read events and trigger execution of the function code. As Azure functions automatically scales up and down, instances of the EventProcessorHost will join and leave the consumer group responsible for processing the configured Event Hub.

When a processor is starting up, it will attempt to balance the work for processing with other processors active for the same consumer group. In the case where a processor isn’t able to reach its fair share of the work by claiming unowned partitions, it will attempt to steal ownership of partitions from other processors. During this time, the new owner will begin reading from the last recorded checkpoint. At the same time, the old owner may be dispatching the events that it last read to the handler for processing; it will not understand that ownership has changed until it attempts to read the next set of events from the Event Hubs service. A similar pattern takes place when a processor shuts down and surrenders its partition ownership.

As a result, you will see some duplicate events being processed when processors are started or stopped which will subside when the processors have reached a stable state with respect to load balancing. The duration of that window should be short, but does differ depending on the configuration of the processor and checkpointing strategy being used.

Jesse Squire
  • 6,107
  • 1
  • 27
  • 30
  • could you say a word or two - what is the solution to avoid duplicates? (de-duplication) – dee zg Nov 04 '21 at 20:43
  • 2
    The consumer tracks events that it has processed and skips duplicates. Typically this is done for a limited period (think: "last X messages" or "within the last X minutes"). It's difficult to generalize, as it's going to vary quite a bit by the application scenario. – Jesse Squire Nov 05 '21 at 13:19