I am trying to process a set of CSVs with pyspark.
I am working on AWS EMR emr-5.27.0, with spark 2.4.4
I try to load the files with :
src_df = spark.read.csv("s3://my-bucket/extract/2019*/*csv.gz", header=True, inferSchema=True)
The problem is that columns may vary a little from a file to another. therefore I have a misalignement of the data depending on the row.
I thought secifying "header" option would fix this, but it seems to use only the schema of the first file loaded.
Any idea?
Thanks in advance.