0

I have a for loop that loops about 1 billion times. There are many database queries and computations within each iteration. The simplified pseudo code looks like below:

for(int i=0, i<1000000000, i++){
    query();
    if(...){
        compute();
    }  
}

If I can set up and run multiple threads in parallel, so each iterates millions of times, that would significantly reduce the time.

Without some kind of parallel processing, it would take months to finish. Is it possible to reduce the time by implementing threads in this situation? I'm aware of the new streams features in Java8 but upgrading to java8 is not an option for me.

If there's an easy-to-follow guide somewhere, that would be great too! Thanks in advance.

edit: here's more detailed code. I'm potentially checking the database multiple times for each insertion, and I have to process the data before doing so. Ideally I want multiple threads to share the workload.

for(int i = 1; i<=100000000; i++){
            String pid = ns.findPId(i); //query
            object g = findObject(pid) //query
            if(g!=null){
                if(g.getSomeProperty()!=null && g.getSomeProperty().matches(EL)){ 
                    int isMatch = checkMatch(pid); //query
                    if(isMatch == 0){ 
                        String sampleId = findSampleId(pid); //query
                        if(sampleId!=null){
                            Object temp = ns.findMoreProperties(sampleId); //query
                            if(temp!=null){
                                g.setSomeAttribute(temp.getSomeAttribute());
                                g.setSomeOtherProperty(temp.getSomeOtherProperty()); 
                                insertObject(g); //compute, encapsulate and insert into database table
                            }
                        }
                    }else{
                        //log
                    }
                }
            }
Andy
  • 167
  • 9
  • 4
    If you are doing database access, acessing the same resource, then you'll run into problems. At the very least - the number of open connections or the number of simultaneous queries you can run with one connection will be limited. And if you're reading or updating the same table - you may end up with a slower solution rather than a faster one. Perhaps you should switch to a distributed platform such as Spark. – RealSkeptic Dec 19 '18 at 09:45
  • 2
    Well, if the iterations are independent you could just split them up into smaller packages. One way to do that would be to create a bunch of `Runnable` or `Callable` instances for the tasks and submit them to a `ThreadPoolExecutor`. – Thomas Dec 19 '18 at 09:46
  • 2
    As said Thoms, if the operations are somewhat independent you can use also `ExecutorService`: https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html – Lorelorelore Dec 19 '18 at 09:47
  • 1
    Can show you an approach if you can give more details on query(), if condition and compute(). – Ian Lim Dec 19 '18 at 09:54
  • @IanLim sure. It's posted – Andy Dec 19 '18 at 10:33
  • @RealSkeptic I left the computer on, and the program stopped iterating(the program itself didn't stop, it just got stuck) at around 100,000 records. I have tried a different approach - moving all the data to memory in one hit before calculating and inserting. This load everything in one hit was fast, but it would run out of memory for data > 1mil records. The executing one at a time approach didn't have the memory problem but would take forever. – Andy Dec 20 '18 at 01:41
  • @Andy what heap size did you get up to? You should be able to use a heap sized to 10s of GB. – tgdavies Dec 27 '18 at 23:03
  • @tgdavies I'm using 10GB already – Andy Dec 28 '18 at 07:15

2 Answers2

1

1) Evaluate and see if you need a ThreadPoolExecutor:

ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(10);

2) Write a Callable for the first part

public class FindObjectCallable implements Callable<Object> {
    ...

    @Override
    public Object call() throws Exception {
        String pid = ns.findPId(i); //query
        return findObject(pid) //query
    }
}

3) Main code to do the following:

    ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(10);

    List<Future<Object>> futures = new ArrayList<Future<Object>>(0);    

    for(int i = 1; i<=100000000; i++) {
        FindObjectCallable callable = new FindObjectCallable( ns, i );
        Future<Object> result = executor.submit(callable);
        futures.add(result);
    }

    for( Future<Object> future: futures )
    {
        // do a java 7 lambda equivalent for the g processing part
    }
Ian Lim
  • 4,164
  • 3
  • 29
  • 43
0

Seems like what you need is something like the Parallel.For that exist in C#. This post adresses that issue with an example of someone who implements his own parallel.For in java: Parallel.For implemented with Java

I wouldn't use the example Dang Nguyen sugessted, because that is just spinning up alot of threads but because there is no locking, there is no thread-safety or proper concurrency. There is a pretty big change you would hit an exception thrown by the database when 2 threads would try to write to the same field in the database at the same time.

Even with a parallel for loop, you still have a chance of running into concurrency problems in the database i think, since 2 thread tasks run in parallel could still be about accessing the same database entity.