How to Write a Spark Dataframe (in DataBricks) to Blob Storage (in Azure)?

Question

I am working in DataBricks, where I have a DataFrame.

type(df) 
Out: pyspark.sql.dataframe.DataFrame

The only thing that I want, is to write this complete spark dataframe into an Azure Blob Storage.

I found this post. So I tried that code:

# Configure blob storage account access key globally
spark.conf.set(
  "fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
  sas_key)

output_container_path = "wasbs://%s@%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path

# write the dataframe as a single file to blob storage
(datafiles
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .format("com.databricks.spark.csv")
 .save(output_blob_folder))

Running that code is leading to the error below. Changing the "csv" part for parquet and other formats is also failing.

org.apache.spark.sql.AnalysisException: CSV data source does not support struct&lt;AccessoryMaterials:string,CommercialOptions:string,DocumentsUsed:array&lt;string&gt;,Enumerations:array&lt;string&gt;,EnvironmentMeasurements:string,Files:array&lt;struct&lt;Value:string,checksum:string,checksumType:string,name:string,size:string&gt;&gt;,GlobalProcesses:string,Printouts:array&lt;string&gt;,Repairs:string,SoftwareCapabilities:string,TestReports:string,endTimestamp:string,name:string,signature:string,signatureMeaning:bigint,startTimestamp:string,status:bigint,workplace:string&gt; data type.;

Therefore my question (and this should be easy is my assumption): How can I write my spark dataframe from DataBricks to an Azure Blob Storage?

My Azure folder structure is like this:

Account = MainStorage 
Container 1 is called "Data" # containing all the data, irrelevant because i already read this in. 
Container 2 is called "Output" # here I want to store my Spark Dataframe.

Many thanks in advance!

EDIT I am using Python. However, I don't mind if the solution is in other languages (as long as DataBricks support them, like R/Scala etc.). If it works, it is perfect :-)

The error message is related to the data source (not supporting struct), writing to Azure Blob is supported by all APIs, see also the documentation: https://learn.microsoft.com/de-de/azure/databricks/data/data-sources/azure/azure-storage Most users work with mounts, also described in the doc. — Hauke Mallow, Mar 26 '20 at 14:43
@R overflow have you gone through the solution I shared with you?? — venus, Apr 02 '20 at 04:22

venus · Accepted Answer · 2020-03-29T19:27:16.543

1

Assuming you have already mounted the blob storage, Use the below approach to write your data frame as a csv format.
Please note newly created file would have the some default file name with csv extension hence you might need to rename it with a consistent name.

// output_container_path= wasbs://ContainerName@StorageAccountName.blob.core.windows.net/DirectoryName 
val mount_root = "/mnt/ContainerName/DirectoryName"
df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")

edited Mar 29 '20 at 19:27

answered Mar 29 '20 at 18:56

venus

1,188
9
18

I followed you example (mounting part is done). However, I am facing: "error not found: value df", while I do have a df (type(df), results in pyspark.sql.dataframe.DataFrame). Could it be the case that Python variables are not recognized in your code, scala? – R overflow Apr 02 '20 at 09:35
yes.. df is a dataframe.. in Databricks if you have defined a variable as python variable then you may not be able to use that variable further in Scala code. – venus Apr 02 '20 at 10:05
I see! It works for Scala. Now checking it for Python, if you are interested (or if you know an answer :-)), raised a questions [here](https://stackoverflow.com/questions/60990525/write-a-pyspark-sql-dataframe-dataframe-without-losing-information) – R overflow Apr 02 '20 at 11:34

How to Write a Spark Dataframe (in DataBricks) to Blob Storage (in Azure)?

1 Answers1