0

Our project requires us to store the xml in the azure blob storage, and right now we have to analysis the xml file in the backend, and then select the xml file by filtering the information stored in the file, and finally return the url of the corresponding xml file.

I have no idea what kind of measure could achieve this, could you help me if you have any idea? Thank you very much.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
tiefu cai
  • 1
  • 1
  • You will need some form of an index for your files and their metadata. This is one of the big advantages to using a document based service like CosmosDB. I see a similar question here, and the answers may be helpful: https://stackoverflow.com/questions/14440506/how-to-query-cloud-blobs-on-windows-azure-storage – Mike Oryszak Mar 22 '19 at 20:05
  • You could use Azure Data Lake Gen2 APIs (https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) to analyze your blobs present in Azure Blob storage with help of any analytics engines such as Hadoop, Spark, etc. provided as part of HDInsight. As part of your analytics job, you will filter the xml files based on their content and write the filtered URLs in another blob/azure table/cosmos db. – Vamshi Mar 22 '19 at 21:34

1 Answers1

0

I created a simple sample to read XML files stored in Azure Blob Storage and parse & filter them by a condition to output a list of blob urls. My sample is using Azure Storage SDK v8.0.0 for Java and a HTML parser jsoup in Java.

Here is the dependencies of my maven project.

<!-- https://mvnrepository.com/artifact/com.microsoft.azure/azure-storage -->
<dependency>
    <groupId>com.microsoft.azure</groupId>
    <artifactId>azure-storage</artifactId>
    <version>8.0.0</version>
</dependency>
<dependency>
    <!-- jsoup HTML parser library @ https://jsoup.org/ -->
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.3</version>
</dependency>

The XML content I used in my project is like as below, and there are 6 files for testing.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE person SYSTEM "person.dtd">
<person>
    <name>Peter Pan</name>
    <gender>Male</gender>
    <age>30</age>
</person>

And the code is as below.

import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.security.InvalidKeyException;
import java.sql.Date;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.Iterator;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;
import com.microsoft.azure.storage.blob.ListBlobItem;
import com.microsoft.azure.storage.blob.SharedAccessBlobPermissions;
import com.microsoft.azure.storage.blob.SharedAccessBlobPolicy;

public class FilterXMLFiles {

    private static final String storageConnectionString = "<your storage account connection string>";
    private static final String containerName = "xmls"; // It's my container to store these XML files.

    private static CloudBlobClient serviceClient;

    public static void main(String[] args) throws InvalidKeyException, URISyntaxException, StorageException, MalformedURLException, IOException {
        CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);
        serviceClient = account.createCloudBlobClient();
        CloudBlobContainer container = serviceClient.getContainerReference(containerName);
        // Generate a SAS token for reading XML files in the container
        SharedAccessBlobPolicy policy = new SharedAccessBlobPolicy();
        policy.setPermissions(EnumSet.allOf(SharedAccessBlobPermissions.class));
        policy.setSharedAccessStartTime(Date.valueOf(LocalDate.now().minusYears(2)));
        policy.setSharedAccessExpiryTime(Date.valueOf(LocalDate.now().plusYears(2)));
        String token = container.generateSharedAccessSignature(policy, null);
        // Get the list of blobs in the container.
        Iterator<ListBlobItem> blobs = container.listBlobs().iterator();
        // Create a List object to store these filtered urls.
        List<String> blobUrls = new ArrayList<>();
        while(blobs.hasNext()) {
            // Get the blob url with SAS token
            String uri = blobs.next().getUri().toString();
            String urlWithSAS = String.format("%s?%s",uri, token);
            // System.out.println(urlWithSAS);
            // Parse and filter by jsoup with the condition age >= 30
            Document root = Jsoup.parse(new URL(urlWithSAS), 30*1000);
            int age = Integer.parseInt(root.selectFirst("age").text());
            if(age >= 30) { // It's the condition age >=30
                blobUrls.add(uri);
            //  blobUrls.add(urlWithSAS);
            }
        }
        System.out.println(String.join("\n", blobUrls));
    }

}

The result looks like this:

https://<my account name>.blob.core.windows.net/xmls/p1.xml
https://<my account name>.blob.core.windows.net/xmls/p3.xml
https://<my account name>.blob.core.windows.net/xmls/p5.xml

The sample is so simple for explaining my idea. Of couse, in a real applicated scenario, considering for filter query flexibility, I think using XQuery like SQL to realize this is a better solution, such as using Saxon (a third party library in Java) instead of jsoup to filter by XQuery Expression as condition. For more details about XQuery, you can refer to Xquery Tutorial and the documents of Saxon.

Peter Pan
  • 23,476
  • 4
  • 25
  • 43