Ideas on concurrent datastructure

Question

I am not sure if i can put my question in the clearest fashion but i will try my best.

Lets say i am retrieving some information from a third party api. The retrieved information will be huge in size. To have a performance gain, instead of retrieving all the info in one go, i will be retrieving the info in a paged fashion (the api gives me that facility, basically an iterator). The return type is basically a list of objects.

My aim here is to process the information i have in hand(that includes comparing and storing in db and many other operations) while i get paged response on the request.

My question here to the expert community is , what data structure do you prefer in such case. Also does a framework like spring batch help you in getting performance gains in such cases.

I know the question is a bit vague, but i am looking for general ideas,tips and pointers.

Question is indeed a bit vague. When inferring data structure preference, are you referring to the returned data structure from the 3rd party API? Usually we don't get a choice in those matters :( Frameworks aren't generally used for performance gains. Usually, developers use them for extensibility (means we have to do less boiler plate code, or rewrite functionality that is already there for us to reuse) — rurouni88, Aug 12 '14 at 02:45
@thePoly_glot Keep in mind that there's a good chance that hitting the API will be the most expensive part of the call. It's often the best idea to get all of your data in one chunk. Have you done profiling that suggests otherwise? — Patrick Collins, Aug 12 '14 at 02:54
@Patrick Collins Yeah. That was the first thing i did to find where the bottle neck is. Yes fetching info from api is the most expensive part. SO i figured out why not do something useful while the api sends me data. — thePoly_glot, Aug 12 '14 at 03:01
I think I have to vote "too vague" also. Concurrency here would be about your processing, not the fact that IO is slow and you read data in chunks. So we need to know all tasks on your end, and probably all data too before we could say anything useful. — markspace, Aug 12 '14 at 03:11
Given that reading the data takes most of the time, try using the [AsyncIterator](http://stackoverflow.com/questions/21143996/asynchronous-iterator). I have a slightly updated/modified version from the answer if you are interested. — vanOekel, Aug 15 '14 at 11:36

score 0 · Answer 1 · answered Aug 12 '14 at 03:01

In terms of doing the actual parallelism, one very useful construct in Java is the ThreadPoolExecutor. A rough sketch of what that might look like is this:

public class YourApp {
    class Processor implements Runnable {
        Widget toProcess;

        public Processor(Widget toProcess) {
            this.toProcess = toProcess;
        }

        public void run() {
            // commit the Widget to the DB, etc
        }
    }

    public static void main(String[] args) {

        ThreadPoolExecutor executor = 
            new ThreadPoolExecutor(1, 10, 30, 
                                   TimeUnit.SECONDS, 
                                   new LinkedBlockingDeque());

        while(thereAreStillWidgets()) {
            ArrayList<Widget> widgets = doExpensiveDatabaseCall();
            for(Widget widget : widgets) {
                Processor procesor = new Processor(widget);
                executor.execute(processor);
            }
        }

    }

}

But as I said in a comment: calls to an external API are expensive. It's very likely that the best strategy is to pull all the Widget objects down from the API in one call, and then process them in parallel once you've got them. Doing more API calls gives you the overhead of sending the data all the way from the server to you, every time -- it's probably best to pay that cost the fewest number of times that you can.

Also, keep in mind that if you're doing DB operations, it's possible that your DB doesn't allow for parallel writes, so you might get a slowdown there.

score 0 · Accepted Answer · answered Aug 15 '14 at 09:43

In these cases, the data structure for me is java.util.concurrent.CompletionService.

For purposes of example, I'm going to assume a couple of additional constraints:

You want only one outstanding request to the remote server at a time
You want to process the results in order.

Here goes:

// a class that knows how to update the DB given a page of results
class DatabaseUpdater implements Callable { ... }
// a background thread to do the work
final CompletionService<Object> exec = new ExecutorCompletionService(
   Executors.newSingleThreadExecutor());

// first call
List<Object> results = ThirdPartyAPI.getPage( ... );
// Start loading those results to DB on background thread
exec.submit(new DatabaseUpdater(results));

while( you need to ) {
  // Another call to remote service
  List<Object> results = ThirdPartyAPI.getPage( ... );
  // wait for existing work to complete
  exec.take(); 
  // send more work to background thread
  exec.submit(new DatabaseUpdater(results));
}
// wait for the last task to complete
exec.take();

This just a simple two-thread design. The first thread is responsible for getting data from the remote service and the second is responsible for writing to the database.

Any exceptions thrown by DatabaseUpdater will be propagated to the main thread when the result is taken (via exec.take()).

Good luck.

Ideas on concurrent datastructure

2 Answers2