29

I am using Microsoft.WindowsAzure.StorageClient to manipulate blobs on Azure storage. I have come to the point where the user needs to list the uploaded files and modify/delete them. Since there are many files in one container, what is the best way to query azure storage services to return only the desired files. Also, I would like to be able to return only specific number of blobs so I can implement paging.

There is a method called ListBlobs in the CloudBlobContainer, but it seems like it's returning all of the blobs in the container. That will not work for me.

I searched a lot on this topic and could not find anything useful. This link shows only the basics.

--------- EDIT

My answer below does not retrieve the blobs lazily, but it retrieves all of the blobs in the container and then filters the result. Currently there is no solution for retrieving blobs lazily.

Gorgi Rankovski
  • 2,303
  • 1
  • 23
  • 32

8 Answers8

34

The method ListBlobs retrieves the blobs in that container lazily. So you can write queries against that method that are not executed until you loop (or materialize objects with ToList or some other method) the list.

Things will get clearer with few examples. For those that don't know how to obtain a reference to a container in your Azure Storage Account, I recommend this tutorial.

Order by last modified date and take page number 2 (10 blobs per page):

blobContainer.ListBlobs().OfType<CloudBlob>()
         .OrderByDescending(b=>b.Properties.LastModified).Skip(10).Take(10);

Get specific type of files. This will work if you have set ContentType at the time of upload (which I strongly recomend you do):

blobContainer.ListBlobs().OfType<CloudBlob>()
         .Where(b=>b.Properties.ContentType.StartsWith("image"));

Get .jpg files and order them by file size, assuming you set file names with their extensions:

blobContainer.ListBlobs().OfType<CloudBlob>()
    .Where(b=>b.Name.EndsWith(".jpg")).OrderByDescending(b=>b.Properties.Length);

At last, the query will not be executed until you tell it to:

var blobs = blobContainer.ListBlobs().OfType<CloudBlob>()
                          .Where(b=>b.Properties.ContentType.StartsWith("image"));

foreach(var b in blobs) //This line will call the service, 
                        //execute the query against it and 
                        //return the desired files
{
   // do something with each file. Variable b is of type CloudBlob
}
Henrik
  • 613
  • 4
  • 11
Gorgi Rankovski
  • 2,303
  • 1
  • 23
  • 32
  • I've tested your code Gorgi (first example) and it still retrieves multiple items when I use .Skip(1).Take(1). so it doesn't seem to lazy load – GeertvdC Jul 18 '13 at 12:11
  • Ok, I'll double check it later and I'll let you know – Gorgi Rankovski Jul 18 '13 at 15:26
  • @GeertvdC I just tried it, works as expected. Can you paste what you have tried? – Gorgi Rankovski Jul 18 '13 at 16:44
  • I've created a sample which does exactly the same as your code and uploaded here: http://sdrv.ms/12NmaIr it will skip 1 and take 1 from the total list of blobs (which is 2). but it does not lazy load when you check with fiddler it still lists 2 files in the response – GeertvdC Jul 19 '13 at 07:01
  • 3
    Holly s***, you are right. In the documentation for ListBlobs() it says that it retrieves the blobs lazily, but it looks like you can't write queries on the properties :\ The only thing that you can query on is the name prefix of the blobs - for example ListBlobs("test") - this returns only one file. – Gorgi Rankovski Jul 19 '13 at 08:14
  • I think it should be possible to do something with odata but haven't figured out how to do it. – GeertvdC Jul 19 '13 at 08:52
  • 5
    Hello Gorgi , The blobContainer.ListBlobs() returns an IEnumerable not IQueryable , in .NET by convention IEnumerable is used to first load all data from the server to memory and then query it in oppose to IQueryable which parses the query to the server , and returns only valid reulsts all in lazy mode. http://www.codeproject.com/Tips/468215/Difference-Between-IEnumerable-and-IQueryable – James Roeiter Aug 24 '13 at 00:10
  • Just because the container retrieves blobs lazily doesn't mean that it's using any kind of advanced search based on your compiled LINQ query. Most likely, it's just moving from blob to blob, running your query after retrieving each one. – NathanAldenSr Nov 13 '13 at 15:37
  • Lazily means it will load the contents first time it is used (not at the time ListBlobs is used) This does not mean that it will list only the part that you've asked in your skip/take. It will load all blobs and then apply skip/take. Smarter skip/take requires IQueryable interface – cellik Dec 17 '15 at 09:00
  • You can also filter by file type https://dev.to/williamxifaras/querying-azure-blob-storage-by-file-type-2boh – William Xifaras Apr 05 '19 at 02:02
21

What I've realized about Windows Azure blob storage is that it is bare-bones. As in extremely bare-bones. You should use it only to store documents and associated metadata and then retrieve individual blobs by ID.

I recently migrated an application from MongoDB to Windows Azure blob storage. Coming from MongoDB, I was expecting a bunch of different efficient ways to retrieve documents. After migrating, I now rely on a traditional RDBMS and ElasticSearch to store blob information in a more searchable way.

It's really too bad that Windows Azure blob storage is so limiting. I hope to see much-enhanced searching capabilities in the future (e.g., search by metadata, property, blob name regex, etc.) Additionally, indexes based on map/reduce would be awesome. Microsoft has the chance to convert a lot of folks over from other document storage systems if they did these things.

NathanAldenSr
  • 7,841
  • 4
  • 40
  • 51
4

Edit

Now in preview is blob index for azure storage which is a managed index of metadata you can add to your blobs (new or existing). This will remove the need to use creative container names for pseudo indexing or to maintain a secondary index yourself.

Original answer

For returning specific results, one possible option is to use the blob and/or container prefix to effectively index what you're storing. For example you could prefix a date and time as you add blobs, or you could prefix a user, depends on your use case as to how you'd want to "index" your blobs. You can then use this prefix or a part of it in the ListBlobs[Segmented] call to return specific results, obviously you'd need to put the most general elements first, then more specific elements, e.g.:

2016_03_15_10_15_blobname

This would allow you to get all 2016 blobs, or March 2016 blobs, etc. but not March blobs in any year without multiple calls.

Downside with this is that if you needed to re-index blobs you'd need to delete and recreate them with a new name.

For paging generally you can use the ListBlobsSegmented method which will give you a continuation token that you can use to implement paging. That said it's not much use if you need to skip pages as it only works by starting from where the last set of actual results left off. One option with this is to calculate the number of pages you need to skip, get these and discard them, then get the actual page you want. If you have a lot of blobs in each container this could get pretty inefficient pretty quickly....

You could also just have this as the fail back method, using a page by page approach and storing the continuation token if the user is clicking one page to the next sequentially OR you could potentially cache blob names and do your own paging from that.

You can also combine these two approaches, e.g. filtering by your "index" then paging on the results.

Community
  • 1
  • 1
Matt
  • 12,569
  • 4
  • 44
  • 42
  • do you know of any information about how fast it is to query by prefix? is it indexed or does it scan through the whole container? – Andy Apr 11 '20 at 07:27
  • @Andy have edited the answer with some helpful links. – Matt Apr 11 '20 at 09:23
  • all these docs talk speficially about table storage. Have you found anything similar about blob storage?. Specifically, If I have a million blobs in the container and I want to locate 10 of them by a prefix query, will this require a full scan? – Andy Apr 14 '20 at 07:34
  • Sorry I didn't even re-read my original answer before editing it, I thought we were talking table storage! The only things I'm aware of for blob storage performance are 1) be careful how you [partition](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist#partitioning) and design container/blob names or 2) pay for [high performance storage](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-performance-tiers). A third option would be to index container or blob names separately in your data-store of choice. – Matt Apr 14 '20 at 20:04
  • @Andy - Check out my edit, a new managed index feature has just been announced. – Matt May 06 '20 at 05:43
1

Stumbled here looking for other options besides the Tags for comparison and at least with the REST api it seems like its now built-in.

The Find Blobs by Tags operation finds all blobs in the storage account whose tags match a given search expression.

https://learn.microsoft.com/en-us/rest/api/storageservices/find-blobs-by-tags

You can add tags when uploading the blob.

Clocker
  • 1,316
  • 2
  • 19
  • 29
0

Azure Data Lake Gen 2 will support data stored in the Data Lake to be searched using USQL. Blob storage APIs can be used to store and retrieve that data.

Jason Steele
  • 1,598
  • 3
  • 23
  • 41
0

U-SQL currently does not support interactive queries/search.

For my use case, i am planning to utilize Azure blob storage for low-cost advantage and on each new blob file creation event trigger Azure Functions to do data transformation on the blob to feed processed output to Azure Cosmos DB or RDBMS (which supports queries).

0

Just configure the diagnostic settings from your resource to persist files in a Log Analytics workspace, and you should be good to go. I know this post is quite old but looks like it is still being indexed for similar questions.

Log analytics

cavok
  • 1
  • 2
0

The Azure Blob Storage got enhanced in the meantime to support index tags.

Quoting: Manage and find Azure Blob data with blob index tags

As datasets get larger, finding a specific object in a sea of data can be difficult. Blob index tags provide data management and discovery capabilities by using key-value index tag attributes. You can categorize and find objects within a single container or across all containers in your storage account. As data requirements change, objects can be dynamically categorized by updating their index tags. Objects can remain in-place with their current container organization.

Blob index tags let you:

  • Dynamically categorize your blobs using key-value index tags
  • Quickly find specific tagged blobs across an entire storage account
  • Specify conditional behaviors for blob APIs based on the evaluation of index tags
  • Use index tags for advanced controls on features like blob lifecycle management

Source code examples here:

Use blob index tags to manage and find data on Azure Blob Storage

Matthias Güntert
  • 4,013
  • 6
  • 41
  • 89