1

GET /Product()?$filter=((PartitionKey%20eq%20'lIkfA81JpTmv')%20and%20(RowKey%20eq%20'')) %20or%20((PartitionKey%20eq%20'lIGcEmrr7hWz')%20and%20(RowKey%20eq%20'')) %20or%20((PartitionKey%20eq%20'lIAoy6PqeMVn')%20and%20(RowKey%20eq%20'')) %20or%20((PartitionKey%20eq%20'lIjETAtuhYGM')%20and%20(RowKey%20eq%20'')) %20or%20((PartitionKey%20eq%20'lIHa0znP5qAk')%20and%20(RowKey%20eq%20'')) %20or%20((PartitionKey%20eq%20'lIOCaSXg9YE7')%20and%20(RowKey%20eq%20'')) %20or%20((PartitionKey%20eq%20'lInRozGrMa7T')%20and%20(RowKey%20eq%20'')) %20or%20((PartitionKey%20eq%20'lILEwwPPcBfe')%20and%20(RowKey%20eq%20'')) %20or%20((PartitionKey%20eq%20'lJ14qZv1KFn4')%20and%20(RowKey%20eq%20''))% 20or%20((PartitionKey%20eq%20'lIIohzupFLcV')%20and%20(RowKey%20eq%20'')).....

Very standard Query to Azure Table Storage for a List(50) of Known PartitionKey and RowKey. This will take 5 seconds for first bite from Server. Is there anyway to speed things up?

Eric Yin
  • 8,737
  • 19
  • 77
  • 118

2 Answers2

2

"Or" queries are not optimized the way you might expect them to be. A query like this results in a full table scan. As Gaurav suggested, you really need to do these as separate queries (in parallel) to get fast response times.

I also thoroughly disagree with Astaykov's statement that you shouldn't bother optimizing because your performance is within the SLA. The performance isn't random, and an SLA is generally an upper bound. Do take the time to optimize your performance-sensitive queries. You should easily be able to do this sort of lookup consistently in sub-second time.

EDIT:

Not sure which language you're working in, but here's a quick Node.js test that seems to usually take between 1 and 1.2 seconds from my house, but occasionally closer to 1.5:

function timeParallelQueries(account, key) {
    var azure = require('azure'),
        Q = require('q'),
        _ = require('underscore');

    var tables = azure.createTableService(account, key);

    function convertToString(n) { return n + ''; }

    var start = null;

    Q.ncall(tables.createTableIfNotExists, tables, 'test')
    .then(function () {
        return Q.all(_.map(_.map(_.range(50), convertToString), function(key) {
            return Q.ncall(tables.insertOrReplaceEntity, tables, 'test', {PartitionKey: key, RowKey: key});
        }));
    })
    .then(function () {
        start = new Date();
        return Q.all(_.map(_.map(_.range(50), convertToString), function (key) {
            return Q.ncall(tables.queryEntity, tables, 'test', key, key);
        }));
    })
    .then(console.log)
    .then(function (results) {
        console.log('Took ' + (new Date() - start) + 'ms.');
    });
}
user94559
  • 59,196
  • 6
  • 103
  • 103
  • I wanna get 50 items by PartitionKey and RowKey pare. Do you suggest me to issue 50 parallel http calls? Or is there anyway to "not trigger a full table scan" get 50 items in one call? – Eric Yin Aug 15 '12 at 18:24
  • 1
    @EricYin - given the way your data is arranged, it's not possible to get them all in one call without a table scan. If you rearranged your keys such that the items you wanted are all in one partition you could do it with a partition scan or possible a range scan. But the feasibility of that depends on your data. – Brian Reischl Aug 15 '12 at 20:13
  • 1
    @breischl is right. Yes, I'm suggesting 50 parallel HTTP calls. – user94559 Aug 15 '12 at 20:41
  • @smarx thanks. In my local box, 50 calls "much more" slower than 1 (10M ADSL). I will try in Cloud tomorrow and see. – Eric Yin Aug 15 '12 at 20:57
  • @EricYin, how are you writing your parallel code? Are you using async methods or threading? Is your default connection limit high enough that you're actually getting parallel calls? – user94559 Aug 15 '12 at 21:49
  • @smarx, I am using `Parallel.ForEach`, very simple code, where to set connection limits? I am using .NET4 – Eric Yin Aug 15 '12 at 22:04
  • @EricYin I just edited my answer to share some Node.js code, but I guess that won't help much. `ServicePointManager.DefaultConnectionLimit` is the setting I was talking about. I think it defaults to 2, so you'll only ever be making 2 calls in parallel. Crank it up and see if that helps, but you may also be at the mercy of `Parallel.ForEach`'s logic about how many concurrent tasks to execute. – user94559 Aug 15 '12 at 22:09
  • @smarx, will check the code, first I will try edit the connection limit. At Parallel, I was set `MaxDegreeOfParallelism` to 50, but later I removed it because I like to let program use whatever resources available. – Eric Yin Aug 15 '12 at 22:13
  • 1
    Begin/End uses completion ports, while Tasks attempts to create a thread for each request. Begin/End is faster by an order of magnitude or more. – Brian Reischl Aug 15 '12 at 22:14
  • There's a start at a good method of doing parallel queries here: http://stackoverflow.com/q/4535740/76263 It has some minor issues, but basically works. – Brian Reischl Aug 15 '12 at 22:20
1

Beside that you are quering for "known" PK & RK, you are providing a lot of them with OR. Having in mind that the chance different partitions to be spread across different physical servers is pretty high, I am not suprised from the results.

Also according to the Storage SLA, a table operation:

Must complete processing or return a continuation within 10 seconds

While an operation on a single entity (that is single pair PK & RK):

Must complete processing within 2 seconds

So 5 seconds is something average and within the SLA. Even you speed up your query somehow, it will not be relible, in terms that the SLA for your query is "within 10 seconds". So all and any efforts you put in optimizing your query may be wasted since this is a variable time depending on lots of factors. And what you achieve today for 3 seconds result, may produce 8 seconds tomorrow and still within the SLA.

I would not go and dig deeper in something that is within SLA.

UPDATE 1

There are number of other ways to minize page load time. You can start thinking Asynchronously! Blow super clean data-empty HTML to the client and load all your data on demand via ajax after page load.

Also think about caching. You can cache (almost) any type of data that is to be presented to the user. With the trade-off "data accuracy" vs "speed loading". So you can cache, or even pre-cache some data that is to be loaded. I think this would be a choice for your scenario, since you know the PK & RK you are looking for - cache that entries and serve them from the cache instead of going to the Table on every request. You can set either absolute expiration or sliding expiration depending on how likely is your data to be changed.

UPDATE 2

As Gaurav mentioned - you can try parallel quering the tables, and put the results into the cache. However the parralelism degree depends on the number of cores you are operating on. So there is no point in doing parralel queries if you are on single core. Still, consider caching and client data binding!

astaykov
  • 30,768
  • 3
  • 70
  • 86
  • :( Thats very sad. These days all web design rules said you should speed your page load in 2 seconds. Seems its a mission impossible with Azure, I already cut my query to two in each page, but average I render my page in 5 seconds :( Which is very slow – Eric Yin Aug 15 '12 at 09:30
  • 1
    Since all your queries are on a PartitionKey and RowKey combination (PartitionKey = something and RowKey = someotherthing), can you try firing these queries separately in parallel instead of doing an "OR" and process the data in your application code and see if that makes any difference. – Gaurav Mantri Aug 15 '12 at 10:35
  • To get 50 items, and to avoid full table scan, in theory, I need 50 http calls in parallel!! I tried it's not a very good idea. And, since its millions of items in DB, I cannot cache them and I will CDN them for generated pages. The problem is first view took very very long. Is there anyway to "not trigger a full table scan" and get 50 items in one call? – Eric Yin Aug 15 '12 at 18:27