0

I need to retrieve 50.000 entities from my Azure Storage Table and the result should be put in a List. Retrieving the entities does not take much time but putting them from an Iterable into a List takes relatively long, around 10 seconds. How can I do this in less time?

The following code retrieves the entries and puts them into an ArrayList:

Iterable<T> items = table.execute(tableQuery);

ArrayList<T> result = new ArrayList<T>();
if (items != null) {
    for (T item : items) {
        result.add(item.getContents());
    }
}

Only 1000 entries at a time are retrieved but the Iteratable handles this automatically from my understanding. This also seems to be the time consuming part, getting the next 1000 entries each time.

I also tried this with executeSegmented and ResultContinuation tokens:

ArrayList<T> result = new ArrayList<T>();
ResultContinuation token = null;        

do {
    ResultSegment<T> segment = table.executeSegmented(tableQuery, token);
    result.addAll(segment.getResults());
    token = segment.getContinuationToken();
} while (token != null);

Here the executeSegmented takes a lot of time.

So these options are both to slow. How can I get higher performance to create this List faster?

Edit

The query is as following:

TableQuery<T> tableQuery = TableQuery.from(classAzure).where(TableQuery.generateFilterCondition("MerchantId", QueryComparisons.EQUAL, merchantId));
Frank
  • 83
  • 2
  • 10
  • Have you profiled this to see where the majority of the time is spent? My guess is that when retrieving 50,000 entities, the majority of the time is going to be spent in retrieval time. Here's a great link on using Stopwatch to get execution time: http://stackoverflow.com/questions/14019510/calculate-the-execution-time-of-a-method – Rob Reagan Mar 01 '17 at 14:58
  • Can you share the query you're using to fetch the data? – Gaurav Mantri Mar 01 '17 at 15:17
  • @RobReagan the `execute` part is fast, but the iterator is relatively slow (so the for loop). I think it is retrieving 1000 entries, processes them and retrieves another 1000 which takes lots of time. Just a guess though. @GauravMantri I added the query in the original post. – Frank Mar 01 '17 at 15:50

2 Answers2

3

So there are two things going on here:

  1. Query is not optimized: I noticed that you're querying on an attribute called MerchantId. Since your query does not include PartitionKey in it, Azure Table Service is doing a full table scan i.e. it starts from the very 1st Partition and finds the data matching your query and then it goes to the next Partition and so on. Depending on how many entities are there in your table, it would result in slowness in your query execution.

Azure Storage team has published an excellent guide on Table design. I would highly recommend that you read it. You can find this guide here: https://learn.microsoft.com/en-us/azure/storage/storage-table-design-guide.

  1. Lazy Iterator: I have not used Java in ages so I may be wrong here. But in C# when you execute this line of code:

    Iterable items = table.execute(tableQuery);

The query doesn't get executed. Query only gets executed when you actually iterate over it.

For slowness, I would recommend taking another look at the query and see if somehow you can include PartitionKey in your query. You can also trace the requests through a tool like Fiddler and notice how many requests are being made to Azure Table Service.

Gaurav Mantri
  • 128,066
  • 12
  • 206
  • 241
  • 1
    I don't think it is related to Lazy Iterator since the code has called result.addAll(segment.getResults()). The problem here is query not optimized since Frank didn't include PartitionKey in his query. – Zhaoxing Lu Mar 01 '17 at 23:44
  • Thank you for your answer. I played a bit with the query but it does not seem to get any faster. The scenario is that the storage table currently contains 50.002 entries, the first two are in partition 1 and the rest in partition 2 (based on their merchantId). I changed the query to check the `PartionKey` instead of the `MerchantId` but it does not improve performance. I could also not make performance improvements by changing the table design. I'm afraid I will need to accept the current performance. – Frank Mar 02 '17 at 10:06
  • 1
    One more thing you would need to realize is that in one request to table service a maximum of 1000 entities are returned. So if you want to fetch 50000 entities, your code is making at least 50 requests to the server. – Gaurav Mantri Mar 02 '17 at 10:34
0

By default, the ArrayList class will dynamically resize in order to accommodate new elements added. I do not know the actual implementation of the resize, but if it's a reallocation and a copy of existing elements, that would be very expensive.

Try this: initialize your ArrayList to a fixed size before you begin adding elements. This will avoid the resizing operation. You can take a good guess as to the capacity needed, then ArrayList will default to its original behavior if you add new elements and there's not enough space for them.

Rob Reagan
  • 7,313
  • 3
  • 20
  • 49
  • When removing the `results.add` from the for loop the loop still takes a long time. The issue really is in the iterator, so this will unfortunately not work. – Frank Mar 02 '17 at 10:01