Aeron cluster gets blocked when taking a big snapshot

Question

I've been experimenting with Aeron cluster, and one thing that is unclear to me is how do you deal with applications where nodes have 10s of gigabytes of state... this state is in memory and is accumulated by playing the events.

However if I initiate a snapshot (only can on leader) this will obviously block since you can't keep applying events and take snapshot at the same time... for latency critical apps obviously you can't wait for seconds while snapshot is taken.

One solution that comes to mind is that follower can take a snapshot and when it's done catch up with master and then take over, when snapshot is taken and log is in right state you know your snapshot is valid. This way you have seconds to take your snapshot.

Or you're leader when it tries to take a snapshot hands over leader to a follower that is the most up to date, takes the snapshot then if needed can take over master again... no blocking your clients.

Am I doing something wrong, or misunderstanding the snapshots?

There is not much info on this amazing library. At least I couldn't find an answer to this.

score 3 · Accepted Answer · answered Dec 02 '21 at 02:50

3

There is an open issue on this feature: https://github.com/real-logic/aeron/issues/1263

answered Dec 02 '21 at 02:50

Michael Barker

14,153
4
48
55

thanks for adding, though i have a question, how come this was not an issue for other systems? almost any reasonably complex system will have gigabytes of in memory state, if you take a snapshot of that it will significant time, knowing that aeron is super popular among traders/exchanges how come they work around this? – vach Dec 02 '21 at 16:27
1

There are a number of approaches and vary by use case. Snapshots can be kept smaller by being strict about what needs to be kept in the working set. Using off heap techniques to make very quick copies of the state. Accepting some portion of downtime/high latency at specific times... – Michael Barker Dec 21 '21 at 01:10
True all of those are valid, however this problem does not need to have a compromise, you can have huuuuuuge applications with gigabytes of memory in working set and have 0 downtime by simply taking snapshots on one of the followers (leaving others to be up to date) and when snapshot is done follower will catch up with leader, and snapshot can be transmitted with low priority to not interfere with main event processing, its a problem that can be solved in a way that does not force one to compromise – vach Dec 29 '21 at 17:07
Looks like the issue and associated PR have been closed without merging. Could we have a status on this? – pcdv Jul 25 '22 at 09:33

Aeron cluster gets blocked when taking a big snapshot

1 Answers1