0

I wanted to infer a safe schema from a JSON data coming from Kafka.

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "input").option("auto.offset.reset", "latest").load()
jsonDF = df.selectExpr("CAST(value AS STRING) jsonData")

There are a couple of solutions on stackoverflow:

  1. link says to save a small batch into a file, infer the schema and then use that schema for the streaming dataframe. Although the question is a year old.
  2. link uses schema_of_json and lit ,although I'm not able to get it to work with streaming DFs.

I know inferring of schema can be dangerous, but the spark application I'm developing has multiple sources with varied schemas(multiple columns). Is there a way I can create a schema based on the columns in the json data and force cast them to String to prevent data loss.

firecast
  • 988
  • 2
  • 10
  • 20

0 Answers0