1

I am working in DataBricks, where I have a DataFrame.

type(df) 
Out: pyspark.sql.dataframe.DataFrame

The only thing that I want, is to write this complete spark dataframe into an Azure Blob Storage.

I found this post. So I tried that code:

# Configure blob storage account access key globally
spark.conf.set(
  "fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
  sas_key)

output_container_path = "wasbs://%s@%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path

# write the dataframe as a single file to blob storage
(datafiles
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .format("com.databricks.spark.csv")
 .save(output_blob_folder))

Running that code is leading to the error below. Changing the "csv" part for parquet and other formats is also failing.

org.apache.spark.sql.AnalysisException: CSV data source does not support struct<AccessoryMaterials:string,CommercialOptions:string,DocumentsUsed:array<string>,Enumerations:array<string>,EnvironmentMeasurements:string,Files:array<struct<Value:string,checksum:string,checksumType:string,name:string,size:string>>,GlobalProcesses:string,Printouts:array<string>,Repairs:string,SoftwareCapabilities:string,TestReports:string,endTimestamp:string,name:string,signature:string,signatureMeaning:bigint,startTimestamp:string,status:bigint,workplace:string> data type.;

Therefore my question (and this should be easy is my assumption): How can I write my spark dataframe from DataBricks to an Azure Blob Storage?

My Azure folder structure is like this:

Account = MainStorage 
Container 1 is called "Data" # containing all the data, irrelevant because i already read this in. 
Container 2 is called "Output" # here I want to store my Spark Dataframe. 

Many thanks in advance!

EDIT I am using Python. However, I don't mind if the solution is in other languages (as long as DataBricks support them, like R/Scala etc.). If it works, it is perfect :-)

R overflow
  • 1,292
  • 2
  • 17
  • 37
  • 1
    The error message is related to the data source (not supporting struct), writing to Azure Blob is supported by all APIs, see also the documentation: https://learn.microsoft.com/de-de/azure/databricks/data/data-sources/azure/azure-storage Most users work with mounts, also described in the doc. – Hauke Mallow Mar 26 '20 at 14:43
  • 1
    @R overflow have you gone through the solution I shared with you?? – venus Apr 02 '20 at 04:22

1 Answers1

1

Assuming you have already mounted the blob storage, Use the below approach to write your data frame as a csv format.
Please note newly created file would have the some default file name with csv extension hence you might need to rename it with a consistent name.

// output_container_path= wasbs://ContainerName@StorageAccountName.blob.core.windows.net/DirectoryName 
val mount_root = "/mnt/ContainerName/DirectoryName"
df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/") 
venus
  • 1,188
  • 9
  • 18
  • I followed you example (mounting part is done). However, I am facing: "error not found: value df", while I do have a df (type(df), results in pyspark.sql.dataframe.DataFrame). Could it be the case that Python variables are not recognized in your code, scala? – R overflow Apr 02 '20 at 09:35
  • yes.. df is a dataframe.. in Databricks if you have defined a variable as python variable then you may not be able to use that variable further in Scala code. – venus Apr 02 '20 at 10:05
  • I see! It works for Scala. Now checking it for Python, if you are interested (or if you know an answer :-)), raised a questions [here](https://stackoverflow.com/questions/60990525/write-a-pyspark-sql-dataframe-dataframe-without-losing-information) – R overflow Apr 02 '20 at 11:34