How to set group.id for consumer group in kafka data source in Structured Streaming?

Question

I want to use Spark Structured Streaming to read from a secure kafka. This means that I will need to force a specific group.id. However, as is stated in the documentation this is not possible. Still, in the databricks documentation https://docs.azuredatabricks.net/spark/latest/structured-streaming/kafka.html#using-ssl, it says that it is possible. Does this only refer to the azure cluster?

Also, by looking at the documentation of the master branch of the apache/spark repo https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md, we can understand that such functionality is intended to be added at later spark releases. Do you know of any plans of such a stable release, that is going to allow setting that consumer group.id?

If not, are there any workarounds for Spark 2.4.0 to be able to set a specific consumer group.id?

score 7 · Accepted Answer · answered Mar 26 '19 at 11:51

7

Currently (v2.4.0) it is not possible.

You can check following lines in Apache Spark project:

https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81 - generate group.id

https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L534 - set it in properties, that are used to create KafkaConsumer

In master branch you can find modification, that enable to setting prefix or particular group.id

https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L83 - generate group.id based on group prefix (groupidprefix)

https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L543 - set previously generated groupId, if kafka.group.id wasn't passed in properties

answered Mar 26 '19 at 11:51

Bartosz Wardziński

6,185
1
19
30

Thanks for the response, any idea on how I would go about implementing those modified classes? Is building a jar out of the package and adding that jar with spark-submit enough? – Panagiotis Fytas Mar 26 '19 at 13:41
@PanagiotisFytas, you can check code in master branch of apache spark. I think it is enough to remove following line (https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L534) and build and add jar to spark-submit and pass `kafka.group.id` property via `option` – Bartosz Wardziński Mar 26 '19 at 15:19
I made this work by adding some commits from the master branch to the 2.4.0 branch. You can check my fork https://github.com/PanagiotisFytas/spark. You build the fat jar with the custom connector by calling [external/kafka-0-10-sql/mvn -DskipTests package]. I am not sure how safe that method is as I have not yet fully tested it. – Panagiotis Fytas Mar 27 '19 at 11:01
1

@PanagiotisFytas, I would be very careful with it. For sure it is not recommended approach in the production . However it is only modification of kafka streaming part. – Bartosz Wardziński Mar 27 '19 at 12:13

score 6 · Answer 2 · answered Oct 12 '20 at 07:25

Since Spark 3.0.0

According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an option kafka.group.id:

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .option("kafka.group.id", "myConsumerGroup")
  .load()

However, Spark will not commit any offsets back so the offsets of your ConsumerGroups will not be stored in Kafka's internal topic __consumer_offsets but rather in Spark's checkpoint files.

Being able to set the group.id is meant to deal with Kafka's latest feature Authorization using Role-Based Access Control for which your ConsumerGroup usually needs to follow naming conventions.

A full example of a Spark 3.x application setting kafka.group.id is discussed and solved here.

score 1 · Answer 3 · answered Jun 27 '20 at 20:22

1

Now with spark3.0, you can specify group.id for kafka https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations

answered Jun 27 '20 at 20:22

Learnis

526
5
25

it has a warnig message that we should not use the group.id. if you turn on the log to be WARN level, you would see this. – soMuchToLearnAndShare Sep 21 '20 at 00:28

score 0 · Answer 4 · answered Dec 10 '19 at 13:43

0

Structured Streaming guide seems to be quite explicit about it:

Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:

group.id: Kafka source will create a unique group id for each query automatically.

auto.offset.reset: Set the source option startingOffsets to specify where to start instead.

answered Dec 10 '19 at 13:43

Jacek Laskowski

72,696
27
242
420

I know. At the time, I asked for workarounds, which are possible. I managed to make it work with the solution suggested. – Panagiotis Fytas Dec 10 '19 at 22:08
@PanagiotisFytas Can you show the code (as a separate answer) to help us understand it better? That'd be super helpful. Thank you. – Jacek Laskowski Dec 11 '19 at 08:15

How to set group.id for consumer group in kafka data source in Structured Streaming?

4 Answers4

Since Spark 3.0.0

Linked

Related