7

Event Hubs don't let you store messages longer than 7 (maybe up to 30) days. What is Azure's suggested architecture for PaaS Event Sourcing with these limitations? If it's Event Hub + snapshotting, what happens if we somehow need to rebuild that state? Additional, is Event Hub's answer to KSQL/Spark Azure Stream Analytics?

Sreeram Garlapati
  • 4,877
  • 17
  • 33
randomsolutions
  • 2,075
  • 1
  • 19
  • 22

3 Answers3

9

Great Question!

Yes, EventHubs is intended to be used for Event Sourcing or Append-only log pattern. EventHubs can be used as source/sink for stream processing & analytics engines like SPARK and hence not its competitor. In general, EventHubs offers similar capabilities as that of Apache Kafka.

& Yes, to implement rebuilding transactions from the append-only log Snapshotting is definitely the recommended approach!

While shaping EventHubs as a product offering, our considerations for assigning a default value for retentionPeriod - were -

  • most of the critical systems create snapshots every few minutes.
  • most of the design patterns around this suggest retaining older snapshots for rebuild

So, it was clear that we don't need infinite log, & a timebound of a day will do for most use-cases. Hence, we started with a default 1 day - and gave a knob until 7 days.

If you think, you would have a case, where you will have to go back in time for >7 days to rebuild a snapshot (for ex: for debugging - which is generally not a 99% scenario - but, agreed that designing & accommodating for this is very-wise), recommended approach is to push the data to an archival store.

When our usage Metrics showed that many of our customers have one EventHubs consumer group dedicated for pushing data to archival store - we wanted to enable this capability out-of-the-box & then started to offer - Event Hubs Capture feature.

More on Event Hubs.

Sreeram Garlapati
  • 4,877
  • 17
  • 33
  • 1
    Any examples of this Sreeram? I'd love to see an implementation of Event Sourcing with Event Hubs. – heymega Nov 23 '18 at 15:16
  • If you use the Event Sourcing pattern then you need to store the original events in a permanent store. Snapshotting, when used in the context of Event Sourcing, means to create an application state based on events. Saving the snapshot without the events would mean you loose the audit trail. Additionally a snapshot is something that a consumer of the events would do based on its usage needs. If the consumer changes, or if you add a new consumer of events, it should be able to recreate a snapshot using the original events. – David May 05 '23 at 06:35
5

Event Hubs are supposed to be used for temporarily storing events while moving them between the data storage instances. You would have to load them to some permanent storage to use for indefinite time period, e.g. Cosmos DB.

KSQL is somewhat comparable to Azure Stream Analytics. Spark is a much more broad product, but you can use Spark to process Event Hubs data.

P.S. I'm not an official speaker of Microsoft, so that's just my view.

Mikhail Shilkov
  • 34,128
  • 3
  • 68
  • 107
  • Note that Event Hub natively supports long-term message retention in either Azure Blob storage or Azure Data Lake https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-about#support-for-real-time-and-batch-processing – Richiban Oct 24 '19 at 11:46
0

In many cases Cosmos DB is a better event store when using the Event Sourcing pattern than Event Hub. The reason is that Cosmos DB does not automatically delete data. With Event Hub you could use the capture feature to copy older data to Azure Blob Storage, but then, when accessing older events you would need to be able to read events from both Event Hub and Blob Storage, which makes the implementation more complex.

In contrast with Cosmos DB, you can store events together with a partition key and a timestamp. Then you can easily filter events or use the Cosmos DB change feed to get notified.

However, also with Cosmos DB, a dump of events to Blob storage may be useful if a full replay of events is often needed and if this gets expensive or takes long. Though for domains with a reasonable amount of events using a database as only store seems best.

David
  • 355
  • 1
  • 9