Python/Azure output filename

Question

I'm new to Azure and Python and was creating a notebook in databricks to output the results of a piece of sql. The code below produces the expected output, but with a default filename that's about 100 characters long. Id like to be able to give the output a sensible name and add a date/time to create uniqueness, something like testfile20191001142340.csv. I've serched high and low and can't find anything that helps, hoping somebody in the community can point me in the right direction

%python
try:
  dfsql = spark.sql("select * from dbsmets1mig02_technical_build.tbl_Temp_Output_CS_Firmware_Final order by record1") #Replace with your SQL
except:
  print("Exception occurred")
if dfsql.count() == 0:
  print("No data rows")
else:
  dfsql.coalesce(1).write.format("com.databricks.spark.csv").option("header","false").option("delimiter","|").mode("overwrite").option("quote","\u0000").save(
"/mnt/publisheddatasmets1mig/metering/smets1mig/cs/system_data_build/firmware/outbound/")

https://stackoverflow.com/questions/2158347/how-do-i-turn-a-python-datetime-into-a-string-with-readable-format-date ? — 4c74356b41, Oct 01 '19 at 13:36

score 0 · Answer 1 · answered Oct 01 '19 at 18:52

The issue with naming a single file is that it pretty much goes against the philosophy of spark. To enable quick processing, Spark has to be able to parallelise writes. For parquet files or other outputs that naturally support parallelizm it's not a problem. In case of .csv files we are used to working with single files and thus a lot of confusion.

Long story short, if you did not use .coalesce(1) Spark would write your data to multiple .csv files in one folder. Since there is only one partition, there will be only one file - but with a generated name. So you have here two options:

rename/move the file afterwards using databricks utils or regular python libraries
.collect the result and save it using other libraries (default would be csv package)

The obvious question you may have is why is it so hard to do something so simple as saving to a single file - and the answer is, because it's a problem for Spark. The issue with your approach to saving a single partition is that if you have more data than can fit in your driver / executor memory, repartitioning to 1 partition or collecting the data to executor is going to simply fail and explode with an exception.

For safely saving to single .csv file you can use toLocalIterator method which loads only one partition to memory at time and within its iterator save your results to a single file using csv package.

Thanks for the update, I'll see what I can find on toLocalIterator — RG0107, Oct 02 '19 at 08:35

Python/Azure output filename

1 Answers1