Spark : import heterogeneous multiple csv

Question

I am trying to process a set of CSVs with pyspark.

I am working on AWS EMR emr-5.27.0, with spark 2.4.4

I try to load the files with :

    src_df = spark.read.csv("s3://my-bucket/extract/2019*/*csv.gz", header=True, inferSchema=True)

The problem is that columns may vary a little from a file to another. therefore I have a misalignement of the data depending on the row.

I thought secifying "header" option would fix this, but it seems to use only the schema of the first file loaded.

Any idea?

Thanks in advance.

Please check https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load — dassum, Dec 08 '19 at 19:23
Thank you @dassum, but it does not help unfortunately. The question you refer to is about loading multiple files. I can load all the files with no issue, my problem is regarding the columns that change from a file to another. — sebastienpauset, Dec 08 '19 at 19:32
When you read CSV files like that, the schema will be inferred from a single file, like you experienced. Even if you were to know the union of the schemas of the different files beforehand, if the order of the columns isn’t preserved, data from different columns will end up in one single column. There’s currently no way to infer the unified schema. — Oliver W., Dec 09 '19 at 13:05
Thank you for your Feedback @OliverW. I hoped I would find a way to infer schema on a per-file basis. I will manage that in an other way. — sebastienpauset, Dec 09 '19 at 17:22

0 Answers0