1

I am trying to process a set of CSVs with pyspark.

I am working on AWS EMR emr-5.27.0, with spark 2.4.4

I try to load the files with :

    src_df = spark.read.csv("s3://my-bucket/extract/2019*/*csv.gz", header=True, inferSchema=True)

The problem is that columns may vary a little from a file to another. therefore I have a misalignement of the data depending on the row.

I thought secifying "header" option would fix this, but it seems to use only the schema of the first file loaded.

Any idea?

Thanks in advance.

  • Please check https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load – dassum Dec 08 '19 at 19:23
  • Thank you @dassum, but it does not help unfortunately. The question you refer to is about loading multiple files. I can load all the files with no issue, my problem is regarding the columns that change from a file to another. – sebastienpauset Dec 08 '19 at 19:32
  • When you read CSV files like that, the schema will be inferred from a single file, like you experienced. Even if you were to know the union of the schemas of the different files beforehand, if the order of the columns isn’t preserved, data from different columns will end up in one single column. There’s currently no way to infer the unified schema. – Oliver W. Dec 09 '19 at 13:05
  • Thank you for your Feedback @OliverW. I hoped I would find a way to infer schema on a per-file basis. I will manage that in an other way. – sebastienpauset Dec 09 '19 at 17:22

0 Answers0