0

I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).

I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet directory including its _SUCCESS and *.crc files.

While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.

Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?

Thanks

Ash
  • 390
  • 4
  • 18

1 Answers1

2

If you are having one parquet file and renaming that file to new filename then new file will be a valid parquet file.

If you are combining one or more parquet files and combining them to one then the combined file will not be a valid parquet file.

  • In case you are combining more parquet files into one then its better to create one file by using spark (using repartition) and write to the table.

    (or)

  • You can also use parquet-tools-**.jar to merge multiple parquet files into one parquet file.

notNull
  • 30,258
  • 4
  • 35
  • 50
  • In this instance, I am producing one file per day (per table), each with a unique name, via a timestamp (I.E. `df.coalesce(1).write.format("parquet").mode("overwrite").save(delta_parquet_output)` ). These would remain the raw source. If there is any process that requires to combine files, this would need to happen down stream and independent of this process. I want to ensure that other apps and processes can deal with files in this way and don't expect a parquet "directory/file" type structure. – Ash Oct 22 '19 at 03:40
  • @Ash, If you are thinking to combine files in any process then go with `csv/text files` so that there will no issues if we combine them into one file! – notNull Oct 22 '19 at 15:02
  • 1
    Yes, I understand. I also understand that to merge parquet's together, they will need to be processed in some form or another (data frames, Athena, what ever) before they can be merged. Thanks for your responses. – Ash Oct 24 '19 at 23:02