I created a simple sample to read XML files stored in Azure Blob Storage and parse & filter them by a condition to output a list of blob urls. My sample is using Azure Storage SDK v8.0.0 for Java and a HTML parser jsoup
in Java.
Here is the dependencies of my maven project.
<!-- https://mvnrepository.com/artifact/com.microsoft.azure/azure-storage -->
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>8.0.0</version>
</dependency>
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
The XML content I used in my project is like as below, and there are 6 files for testing.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE person SYSTEM "person.dtd">
<person>
<name>Peter Pan</name>
<gender>Male</gender>
<age>30</age>
</person>
And the code is as below.
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.security.InvalidKeyException;
import java.sql.Date;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.EnumSet;
import java.util.Iterator;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;
import com.microsoft.azure.storage.blob.ListBlobItem;
import com.microsoft.azure.storage.blob.SharedAccessBlobPermissions;
import com.microsoft.azure.storage.blob.SharedAccessBlobPolicy;
public class FilterXMLFiles {
private static final String storageConnectionString = "<your storage account connection string>";
private static final String containerName = "xmls"; // It's my container to store these XML files.
private static CloudBlobClient serviceClient;
public static void main(String[] args) throws InvalidKeyException, URISyntaxException, StorageException, MalformedURLException, IOException {
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);
serviceClient = account.createCloudBlobClient();
CloudBlobContainer container = serviceClient.getContainerReference(containerName);
// Generate a SAS token for reading XML files in the container
SharedAccessBlobPolicy policy = new SharedAccessBlobPolicy();
policy.setPermissions(EnumSet.allOf(SharedAccessBlobPermissions.class));
policy.setSharedAccessStartTime(Date.valueOf(LocalDate.now().minusYears(2)));
policy.setSharedAccessExpiryTime(Date.valueOf(LocalDate.now().plusYears(2)));
String token = container.generateSharedAccessSignature(policy, null);
// Get the list of blobs in the container.
Iterator<ListBlobItem> blobs = container.listBlobs().iterator();
// Create a List object to store these filtered urls.
List<String> blobUrls = new ArrayList<>();
while(blobs.hasNext()) {
// Get the blob url with SAS token
String uri = blobs.next().getUri().toString();
String urlWithSAS = String.format("%s?%s",uri, token);
// System.out.println(urlWithSAS);
// Parse and filter by jsoup with the condition age >= 30
Document root = Jsoup.parse(new URL(urlWithSAS), 30*1000);
int age = Integer.parseInt(root.selectFirst("age").text());
if(age >= 30) { // It's the condition age >=30
blobUrls.add(uri);
// blobUrls.add(urlWithSAS);
}
}
System.out.println(String.join("\n", blobUrls));
}
}
The result looks like this:
https://<my account name>.blob.core.windows.net/xmls/p1.xml
https://<my account name>.blob.core.windows.net/xmls/p3.xml
https://<my account name>.blob.core.windows.net/xmls/p5.xml
The sample is so simple for explaining my idea. Of couse, in a real applicated scenario, considering for filter query flexibility, I think using XQuery
like SQL to realize this is a better solution, such as using Saxon
(a third party library in Java) instead of jsoup
to filter by XQuery Expression
as condition. For more details about XQuery, you can refer to Xquery Tutorial
and the documents of Saxon
.