Understanding data transferring between Operators in Flink (Batch)

Question

Im still struggeling about how flink "exchanges/transffers" data between different operators and what happens with the actual data between the operators.

Take the example DAG above: DAG of execution

The DataSet gets forwarded/transferred to all parallel instances of the GroupReduce Operator, the Data gets reduced according the the GroupReduce Transformation.
All of the new Data gets forwarded to the Filter->Map->Map Operand i.e all data consumed by one of the parallel instances of the GroupReduce operator is transferred to exactly one instance of the Filter->Map->Map operator (Without need for serialization/deserialization, so the Operator accesses the data generated by the GroupReduce Operator)
All of the GroupReduces Output Data gets hashed and evenly distributed/transferred among all the parallel instances of the (Filter->Map) Operator (serialization/deserialization needed between the operators)

So if for example the the GroupReduce Operators output is about 100MB, it will forward 100MB to the (Filter->Map->Map) Operand and hashes a copy of that 100MB and transferr it to the (Filter->Map) Instances. So I will gernerate another 100MB of netwerk traffic

Im quite confused why there is so much network traffic after the GroupReduce and before the Filter Steps. Wouldnt it be better to chain the GroupRedcue and Filter steps together before sending the now reduced data to subsequent operators ?

Flink cannot chain the `GroupReduce` and `Filter` operators for you because you have two different filters after the `GroupReduce`. Besides that, the second filter is coming after a hash partition, which I assume that you are using some `keyBy` operator. This is the same of shuffle phase in Spark where all instances of the upstream operator sends data to all instances of the downstream operator. Shuffle phase has a lot of trafic. — Felipe, Nov 10 '20 at 11:28
The keyBy Operator comes after the (Filter->Map) Operator, so im wondering why flink hashes the dataset before the (Filter-Map) and not afterwards, when a lot of the data gets filtered — tooobsias, Nov 10 '20 at 11:46

score 1 · Accepted Answer · answered Nov 10 '20 at 12:03

The GroupReduce function is the same as using a combiner from MapReduce programming model.

Partial computation can significantly improve the performance of a GroupReduceFunction. This technique is also known as applying a Combiner. Implement the GroupCombineFunction interface to enable partial computations, i.e., a combiner for this GroupReduceFunction.

So, after a combiner there is always a shuffle phase/partition that connects all upstream operators to all downstream operators. Check this answer to clarify what is a combiner.

Understanding data transferring between Operators in Flink (Batch)

1 Answers1