0

Considering data from both the topics are joined at one point and sent to Kafka sink finally which is the best way to read from multiple topics

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("subscribe", "t1,t2")

vs

val df1 = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("subscribe", "t1")

val df2 = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("subscribe", "t2")

Somewhere i will df1.join(df2) and send it to Kafka sink.

With respect to performance and resource usage wise which would be the better option here?

Thanks in advance

PS : I see another similar question Spark structured streaming app reading from multiple Kafka topics but there dataframes from 2 topics seems to be not used together

Vindhya G
  • 1,339
  • 2
  • 21
  • 46

1 Answers1

2

In first approach, you'd have to add a filter at some point and then proceed with join. Unless, you also want to process both the streams together, 2nd approach is tidbit more performant and easier to maintain.

I'd say approach 2 is a straightforward one and skips a filter stage, hence a little bit more efficient. It also offers autonomy in both the streams from infra point of view, example: one of topic was to move to a new kafka cluster. You also don't have to account for unevenness between two topics, example: number of partitions. This makes job tuning easier.

D3V
  • 1,543
  • 11
  • 21
  • Thanks @D3V . I think last point would be major plus point for us as 2 topics are way different in terms of traffic. – Vindhya G Apr 14 '20 at 04:03