Consume Kafka records using PySpark

Asked Jun 01 '23 at 15:02

Active Jun 01 '23 at 21:37

Viewed 14 times

I want to read Kafka records serialized in Avro format using PySpark. I also read the Avro Schema from Confluent Schema Registry (avro Str).

I'm able to read from kafka (it returns the Kafka metadata like key, value, topic, partition, offset, timestamp and timestamptype), but I want to flatten the values into a PySpark dataframe. How deserialize and flatten the value into a PySpark dataframe?

Here is the Code:

df_kafka = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "bootstrap_servers") \
  .option("subscribePattern", "topic") \
  .option("kafka.security.protocol","SSL") \
  .option("kafka.ssl.truststore.location", "cert.jks") \
  .option("startingOffsets", "earliest") \
  .option("endingOffsets", "latest") \
  .option("mode", "PERMISSIVE") \
  .load()

edited Jun 01 '23 at 21:37

OneCricketeer

179,855
19
132
245

asked Jun 01 '23 at 15:02

Ali Lordifar

1) Please don't repost. 2) "Confluent Kafka" is not a thing. Avro data can live in **Apache** Kafka, alone. 3) The Spark documentation states the key and values are returned as `Array[Byte]` types, and you need a UDF function to deserialize them. – OneCricketeer Jun 01 '23 at 21:38

Consume Kafka records using PySpark

0 Answers0

Linked