2

Currently, I'm using the azure-storage-blob and hadoop-azure packages for downloading files from a Blob Storage to local.

...
String url = "https://blob_storage_url";

String filename = url.replaceFirst("https.*/", "");

// Setup the cloud storage account
String storageConnectionString = "...";
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);

// Create a blob service client
CloudBlobClient blobClient = account.createCloudBlobClient();

// Get a reference to a container
CloudBlobContainer container = blobClient.getContainerReference(containerName);

for (ListBlobItem blobItem : container.listBlobs(filename)) {
    // If the item is a blob, not a virtual directory
    if (blobItem instanceof CloudBlockBlob) {
        // Download the file
        CloudBlockBlob retrievedBlob = (CloudBlockBlob) blobItem;
        retrievedBlob.downloadToFile(filename);
    }
}
...

These downloaded files are really XML files. Then, I've to process the content for each one. To do this, I use spark-xml_2.11 (com.databricks.spark.xml) package.

StructType schema = new StructType()
    .add("attr1", DataTypes.StringType, false)
    .add("attr2", DataTypes.IntegerType, false)
    ... other_structFields_or_structTypes;

Dataset<Row> dataset = sparkSession.read()
    .format("com.databricks.spark.xml")
    .schema(schema)
    .load(filename)

The load() method requires a path (data backed by a local or distributed file system). So, is there an option to load them from Blob Storage directly?

I found this guide https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html But the first option, Mount Azure Blob Storage containers to DBFS, requires a Databrick cluster.

With the second option, Access Azure Blob Storage directly, I tested setting up the account access key before.

sparkSession.sparkContext().hadoopConfiguration().set(
    "fs.azure.account.key.<my-storage-account-name>.blob.core.windows.net",
    "<my-storage-account-access-key>"
);

StructType schema = new StructType()
    .add("attr1", DataTypes.StringType, false)
    .add("attr2", DataTypes.IntegerType, false)
    ... other_structFields_or_structTypes;

Dataset<Row> dataset = sparkSession.read()
    .format("com.databricks.spark.xml")
    .schema(schema)
    .load(filename) # also I tried with the full url

But the following exception was raised:

"java.io.IOException: No FileSystem for scheme: https". 

Also, I tried to change the protocol to wasbs, but again a similar exception was raised:

"java.io.IOException: No FileSystem for scheme: wasbs".

Please, any suggestion or comment?

DennisLi
  • 3,915
  • 6
  • 30
  • 66
JRH
  • 21
  • 1
  • add dependecy jar to access azure blob filesystem – Mahesh Gupta Sep 23 '19 at 10:49
  • Project is managed with maven and all dependencies are included/imported. – JRH Sep 23 '19 at 15:14
  • so from where are you accessing your blob storage like local or somewhere else – Mahesh Gupta Sep 24 '19 at 05:33
  • @MaheshGupta from local – JRH Sep 25 '19 at 07:03
  • so you need to replace all jar(spark_home/jars) with newly generated jar by Maven and then try to run – Mahesh Gupta Sep 25 '19 at 07:22
  • 1
    I'm sorry, but I didn't understand you, or maybe I explained it wrong before.None Spark installation/deployment exists on my computer. It's a Java project with all maven dependencies (spark, azure, hadoop...) included in pom.xml. So, I run it locally. The spark job starts/runs locally, accesses to blob storage, gets the files and finally reads their content (XML). – JRH Sep 26 '19 at 08:18

0 Answers0