The scenario is: EventHub -> Azure Databricks (using pyspark)
File format: CSV (Quoted, Pipe delimited and custom schema )
I am trying to read CSV strings comming from eventhub. Spark is successfully creating the dataframe with the proper schema, but the dataframe end up empty after every message.
I managed to do some tests outside streaming environment, and when getting the data from a file, all goes well, but it fails when the data comes from a string.
So I found some links to help me on this, but none worked:
can-i-read-a-csv-represented-as-a-string-into-apache-spark-using-spark-csv?rq=1
Pyspark - converting json string to DataFrame
Right now I have the code below:
schema = StructType([StructField("Decisao",StringType(),True), StructField("PedidoID",StringType(),True), StructField("De_LastUpdated",StringType(),True)])
body = 'DECISAO|PEDIDOID|DE_LASTUPDATED\r\n"asdasdas"|"1015905177"|"sdfgsfgd"'
csvData = sc.parallelize([body])
df = spark.read \
.option("header", "true") \
.option("mode","FAILFAST") \
.option("delimiter","|") \
.schema(schema) \
.csv(csvData)
df.show()
Is that even possible to do with CSV files?