How to read a CSV wih an avro schema object as header on pyspark?

Question

I have a file that I can correctly read this way:

sqlContext.read.format('csv').options(header='false', inferSchema='true', delimiter = "\a", nullValue = '\\N').load('adl://resource.azuredatalakestore.net/datalake-prod/raw/something/data/something/date_part={}/{}'.format(elem[0], elem[1]))

problem is that there is no header, the header is actually in another file of type avsc, an Apache Avro schema object.

What's the best way to use it as header of my DF?

I'm running pyspark on Azure Databricks.

It might be worth finding out how the "raw data" is loaded... If you get an AVSC, then did you have actual Avro at one point? If so, why was it converted to CSV? — OneCricketeer, May 24 '19 at 19:59

simon_dmorias · Answer 1 · 2019-05-24T19:46:39.420

0

Do you also have an avro file? The databricks site has this example of reading a avsc file (https://docs.databricks.com/spark/latest/data-sources/read-avro.html). So you could read the avsc file first into a dataframe:

import org.apache.avro.Schema

val schema = new Schema.Parser().parse(new File("user.avsc"))

val df = spark
  .read
  .format("avro")
  .option("avroSchema", schema.toString)
  .load("/tmp/episodes.avro")

If you do not have an avro file to go with it you could try pointing that code at an empty folder.

Then use the schema on the csv file:

val df = spark.read.format("csv").schema(df.schema).load(csvFilePath)

edited May 24 '19 at 19:46

answered May 24 '19 at 19:19

simon_dmorias

2,343
3
19
33

Note: A `.avsc` file can't be read using `load(avrofilePath)` – OneCricketeer May 24 '19 at 19:38
Thank you for the idea of point to an empty path. I also see that example. My issue is that I cannot translate that code in pyspark. I even import spark avro dependency but it says it doesn't exists as a module. – dierre May 25 '19 at 07:15
For PySpark try the solution here: https://stackoverflow.com/questions/54693110/pyspark-2-4-0-read-avro-from-kafka-with-read-stream-python. Worst case read the avsc json into a df and build a schema by iterating on it. – simon_dmorias May 26 '19 at 09:15

How to read a CSV wih an avro schema object as header on pyspark?

1 Answers1