Currently, I'm using the azure-storage-blob and hadoop-azure packages for downloading files from a Blob Storage to local.
...
String url = "https://blob_storage_url";
String filename = url.replaceFirst("https.*/", "");
// Setup the cloud storage account
String storageConnectionString = "...";
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);
// Create a blob service client
CloudBlobClient blobClient = account.createCloudBlobClient();
// Get a reference to a container
CloudBlobContainer container = blobClient.getContainerReference(containerName);
for (ListBlobItem blobItem : container.listBlobs(filename)) {
// If the item is a blob, not a virtual directory
if (blobItem instanceof CloudBlockBlob) {
// Download the file
CloudBlockBlob retrievedBlob = (CloudBlockBlob) blobItem;
retrievedBlob.downloadToFile(filename);
}
}
...
These downloaded files are really XML files. Then, I've to process the content for each one. To do this, I use spark-xml_2.11 (com.databricks.spark.xml) package.
StructType schema = new StructType()
.add("attr1", DataTypes.StringType, false)
.add("attr2", DataTypes.IntegerType, false)
... other_structFields_or_structTypes;
Dataset<Row> dataset = sparkSession.read()
.format("com.databricks.spark.xml")
.schema(schema)
.load(filename)
The load() method requires a path (data backed by a local or distributed file system). So, is there an option to load them from Blob Storage directly?
I found this guide https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html But the first option, Mount Azure Blob Storage containers to DBFS, requires a Databrick cluster.
With the second option, Access Azure Blob Storage directly, I tested setting up the account access key before.
sparkSession.sparkContext().hadoopConfiguration().set(
"fs.azure.account.key.<my-storage-account-name>.blob.core.windows.net",
"<my-storage-account-access-key>"
);
StructType schema = new StructType()
.add("attr1", DataTypes.StringType, false)
.add("attr2", DataTypes.IntegerType, false)
... other_structFields_or_structTypes;
Dataset<Row> dataset = sparkSession.read()
.format("com.databricks.spark.xml")
.schema(schema)
.load(filename) # also I tried with the full url
But the following exception was raised:
"java.io.IOException: No FileSystem for scheme: https".
Also, I tried to change the protocol to wasbs, but again a similar exception was raised:
"java.io.IOException: No FileSystem for scheme: wasbs".
Please, any suggestion or comment?