Apache Flink Resource Planning best practices

Question

I'm looking for recommendations/best practices in determining required optimal resources for deploying a streaming job on Flink Cluster.

Resources are

No. of tasks slots per TaskManager
Optimal Memory allocation for TaskManager
Max Parallelism

score 1 · Answer 1 · answered Jul 30 '20 at 11:34

This blog post gives some ideas on how to size. It's meant for moving a Flink application under development to production.

I'm not aware of a resource that helps to size before that, as the topology of the job has a tremendous impact. So you'd usually start with a PoC and low data volume and then extrapolate your findings.

Memory settings are described on the Flink docs. I'd also use the appropriate page for your Flink version as it got changed recently.

score 1 · Answer 2 · answered Jul 30 '20 at 12:09

Number of task slots per Task Manager

One slot per TM is a rough rule of thumb as a starting point, but you probably want to the keep the number of TMs under 100, or so. This is because the Checkpoint Coordinator will eventually struggle if it has to manage too many distinct TMs. Running with lots of slots per TM works better with RocksDB than with the heap-based state backends, because with RocksDB the state is off-heap -- with state on the heap, running with lots of slots increases the likelihood of significant GC pauses.

Max Parallelism

The default is 128. Changing this parameter is painful, as it is baked into each checkpoint and savepoint. But making it larger than necessary comes with some cost (in memory/performance). Make it large enough that you will never have to change it, but no larger.

can we also say FS state backend also works better than heap-based state backend in this context? — ardhani, Jul 30 '20 at 12:24
The FS state backend is one of the heap-based state backends. — David Anderson, Jul 30 '20 at 12:38

Apache Flink Resource Planning best practices

2 Answers2